Artificial Intelligence
Authors and titles for March 2024
Total of 2552 entries :
2-2001
2001-2552
- [2] arXiv:2403.00315 [ pdf , ps , other ]
-
Title: Axe the X in XAI: A Plea for Understandable AISubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: In a recent paper, Erasmus et al. (2021) defend the idea that the ambiguity of the term "explanation" in explainable AI (XAI) can be solved by adopting any of four different extant accounts of explanation in the philosophy of science: the Deductive Nomological, Inductive Statistical, Causal Mechanical, and New Mechanist models. In this chapter, I show that the authors' claim that these accounts can be applied to deep neural networks as they would to any natural phenomenon is mistaken. I also provide a more general argument as to why the notion of explainability as it is currently used in the XAI literature bears little resemblance to the traditional concept of scientific explanation. It would be more fruitful to use the label "understandable AI" to avoid the confusion that surrounds the goal and purposes of XAI. In the second half of the chapter, I argue for a pragmatic conception of understanding that is better suited to play the central role attributed to explanation in XAI. Following Kuorikoski & Ylikoski (2015), the conditions of satisfaction for understanding an ML system are fleshed out in terms of an agent's success in using the system, in drawing correct inferences from it.
- [3] arXiv:2403.00318 [ pdf , ps , html , other ]
-
Title: Deep Reinforcement Learning for Solving Management Problems: Towards A Large Management ModeJinyang Jiang , Xiaotian Liu , Tao Ren , Qinghao Wang , Yi Zheng , Yufu Du , Yijie Peng , Cheng ZhangSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: We introduce a deep reinforcement learning (DRL) approach for solving management problems including inventory management, dynamic pricing, and recommendation. This DRL approach has the potential to lead to a large management model based on certain transformer neural network structures, resulting in an artificial general intelligence paradigm for various management tasks. Traditional methods have limitations for solving complex real-world problems, and we demonstrate how DRL can surpass existing heuristic approaches for solving management tasks. We aim to solve the problems in a unified framework, considering the interconnections between different tasks. Central to our methodology is the development of a foundational decision model coordinating decisions across the different domains through generative decision-making. Our experimental results affirm the effectiveness of our DRL-based framework in complex and dynamic business environments. This work opens new pathways for the application of DRL in management problems, highlighting its potential to revolutionize traditional business management.
- [4] arXiv:2403.00323 [ pdf , ps , html , other ]
-
Title: Softened Symbol Grounding for Neuro-symbolic SystemsComments: Published as a conference paper at ICLR 2023. Code is available at this https URLSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Neuro-symbolic learning generally consists of two separated worlds, i.e., neural network training and symbolic constraint solving, whose success hinges on symbol grounding, a fundamental problem in AI. This paper presents a novel, softened symbol grounding process, bridging the gap between the two worlds, and resulting in an effective and efficient neuro-symbolic learning framework. Technically, the framework features (1) modeling of symbol solution states as a Boltzmann distribution, which avoids expensive state searching and facilitates mutually beneficial interactions between network training and symbolic reasoning;(2) a new MCMC technique leveraging projection and SMT solvers, which efficiently samples from disconnected symbol solution spaces; (3) an annealing mechanism that can escape from %being trapped into sub-optimal symbol groundings. Experiments with three representative neuro symbolic learning tasks demonstrate that, owining to its superior symbol grounding capability, our framework successfully solves problems well beyond the frontier of the existing proposals.
- [5] arXiv:2403.00329 [ pdf , ps , html , other ]
-
Title: Learning with Logical Constraints but without Shortcut SatisfactionComments: Published as a conference paper at ICLR 2023, and code is available at this https URLSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Recent studies in neuro-symbolic learning have explored the integration of logical knowledge into deep learning via encoding logical constraints as an additional loss function. However, existing approaches tend to vacuously satisfy logical constraints through shortcuts, failing to fully exploit the knowledge. In this paper, we present a new framework for learning with logical constraints. Specifically, we address the shortcut satisfaction issue by introducing dual variables for logical connectives, encoding how the constraint is satisfied. We further propose a variational framework where the encoded logical constraint is expressed as a distributional loss that is compatible with the model's original training loss. The theoretical analysis shows that the proposed approach bears salient properties, and the experimental evaluations demonstrate its superior performance in both model generalizability and constraint satisfaction.
- [6] arXiv:2403.00685 [ pdf , ps , html , other ]
-
Title: Know your exceptions: Towards an Ontology of Exceptions in Knowledge RepresentationComments: 18 pages, 4 pages are appendix. (v2 updates: minor revisions on discussions, terminology and text editing)Subjects: Artificial Intelligence (cs.AI)
Abstract: Defeasible reasoning is a kind of reasoning where some generalisations may not be valid in all circumstances, that is general conclusions may fail in some cases. Various formalisms have been developed to model this kind of reasoning, which is characteristic of common-sense contexts. However, it is not easy for a modeller to choose among these systems the one that better fits its domain from an ontological point of view. In this paper we first propose a framework based on the notions of exceptionality and defeasibility in order to be able to compare formalisms and reveal their ontological commitments. Then, we apply this framework to compare four systems, showing the differences that may occur from an ontological perspective.
- [7] arXiv:2403.00690 [ pdf , ps , html , other ]
-
Title: Playing NetHack with LLMs: Potential & Limitations as Zero-Shot AgentsSubjects: Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have shown great success as high-level planners for zero-shot game-playing agents. However, these agents are primarily evaluated on Minecraft, where long-term planning is relatively straightforward. In contrast, agents tested in dynamic robot environments face limitations due to simplistic environments with only a few objects and interactions. To fill this gap in the literature, we present NetPlay, the first LLM-powered zero-shot agent for the challenging roguelike NetHack. NetHack is a particularly challenging environment due to its diverse set of items and monsters, complex interactions, and many ways to die.
NetPlay uses an architecture designed for dynamic robot environments, modified for NetHack. Like previous approaches, it prompts the LLM to choose from predefined skills and tracks past interactions to enhance decision-making. Given NetHack's unpredictable nature, NetPlay detects important game events to interrupt running skills, enabling it to react to unforeseen circumstances. While NetPlay demonstrates considerable flexibility and proficiency in interacting with NetHack's mechanics, it struggles with ambiguous task descriptions and a lack of explicit feedback. Our findings demonstrate that NetPlay performs best with detailed context information, indicating the necessity for dynamic methods in supplying context information for complex games such as NetHack. - [8] arXiv:2403.00783 [ pdf , ps , html , other ]
-
Title: On the Roles of LLMs in Planning: Embedding LLMs into Planning GraphsSubjects: Artificial Intelligence (cs.AI)
Abstract: Plan synthesis aims to generate a course of actions or policies to transit given initial states to goal states, provided domain models that could be designed by experts or learnt from training data or interactions with the world. Intrigued by the claims of emergent planning capabilities in large language models (LLMs), works have been proposed to investigate the planning effectiveness of LLMs, without considering any utilization of off-the-shelf planning techniques in LLMs. In this paper, we aim to further study the insight of the planning capability of LLMs by investigating the roles of LLMs in off-the-shelf planning frameworks. To do this, we investigate the effectiveness of embedding LLMs into one of the well-known planning frameworks, graph-based planning, proposing a novel LLMs-based planning framework with LLMs embedded in two levels of planning graphs, i.e., mutual constraints generation level and constraints solving level. We empirically exhibit the effectiveness of our proposed framework in various planning domains.
- [9] arXiv:2403.00805 [ pdf , ps , other ]
-
Title: A New Dynamic Distributed Planning Approach: Application to DPDP ProblemsComments: Master's thesis, in French languageSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: In this work, we proposed a new dynamic distributed planning approach that is able to take into account the changes that the agent introduces on his set of actions to be planned in order to take into account the changes that occur in his environment. Our approach fits into the context of distributed planning for distributed plans where each agent can produce its own plans. According to our approach the generation of the plans is based on the satisfaction of the constraints by the use of the genetic algorithms. Our approach is to generate, a new plan by each agent, whenever there is a change in its set of actions to plan. This in order to take into account the new actions introduced in its new plan. In this new plan, the agent takes, each time, as a new action set to plan all the old un-executed actions of the old plan and the new actions engendered by the changes and as a new initial state; the state in which the set of actions of the agent undergoes a change. In our work, we used a concrete case to illustrate and demonstrate the utility of our approach.
- [10] arXiv:2403.00810 [ pdf , ps , html , other ]
-
Title: Bootstrapping Cognitive Agents with a Large Language ModelSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Large language models contain noisy general knowledge of the world, yet are hard to train or fine-tune. On the other hand cognitive architectures have excellent interpretability and are flexible to update but require a lot of manual work to instantiate. In this work, we combine the best of both worlds: bootstrapping a cognitive-based model with the noisy knowledge encoded in large language models. Through an embodied agent doing kitchen tasks, we show that our proposed framework yields better efficiency compared to an agent based entirely on large language models. Our experiments indicate that large language models are a good source of information for cognitive architectures, and the cognitive architecture in turn can verify and update the knowledge of large language models to a specific domain.
- [11] arXiv:2403.00811 [ pdf , ps , html , other ]
-
Title: Cognitive Bias in High-Stakes Decision-Making with LLMsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Large language models (LLMs) offer significant potential as tools to support an expanding range of decision-making tasks. However, given their training on human (created) data, LLMs can inherit both societal biases against protected groups, as well as be subject to cognitive bias. Such human-like bias can impede fair and explainable decisions made with LLM assistance. Our work introduces BiasBuster, a framework designed to uncover, evaluate, and mitigate cognitive bias in LLMs, particularly in high-stakes decision-making tasks. Inspired by prior research in psychology and cognitive sciences, we develop a dataset containing 16,800 prompts to evaluate different cognitive biases (e.g., prompt-induced, sequential, inherent). We test various bias mitigation strategies, amidst proposing a novel method using LLMs to debias their own prompts. Our analysis provides a comprehensive picture on the presence and effects of cognitive bias across different commercial and open-source models. We demonstrate that our self-help debiasing effectively mitigate cognitive bias without having to manually craft examples for each bias type.
- [12] arXiv:2403.00823 [ pdf , ps , html , other ]
-
Title: Adapting to Teammates in a Cooperative Language GameSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: The game of Codenames has recently emerged as a domain of interest for intelligent agent design. The game is unique due to the way that language and coordination between teammates play important roles. Previous approaches to designing agents for this game have utilized a single internal language model to determine action choices. This often leads to good performance with some teammates and inferior performance with other teammates, as the agent cannot adapt to any specific teammate. In this paper we present the first adaptive agent for playing Codenames. We adopt an ensemble approach with the goal of determining, during the course of interacting with a specific teammate, which of our internal expert agents, each potentially with its own language model, is the best match. One difficulty faced in this approach is the lack of a single numerical metric that accurately captures the performance of a Codenames team. Prior Codenames research has utilized a handful of different metrics to evaluate agent teams. We propose a novel single metric to evaluate the performance of a Codenames team, whether playing a single team (solitaire) game, or a competitive game against another team. We then present and analyze an ensemble agent which selects an internal expert on each turn in order to maximize this proposed metric. Experimental analysis shows that this ensemble approach adapts to individual teammates and often performs nearly as well as the best internal expert with a teammate. Crucially, this success does not depend on any previous knowledge about the teammates, the ensemble agents, or their compatibility. This research represents an important step to making language-based agents for cooperative language settings like Codenames more adaptable to individual teammates.
- [13] arXiv:2403.00829 [ pdf , ps , html , other ]
-
Title: TroubleLLM: Align to Red Team ExpertSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) become the start-of-the-art solutions for a variety of natural language tasks and are integrated into real-world applications. However, LLMs can be potentially harmful in manifesting undesirable safety issues like social biases and toxic content. It is imperative to assess its safety issues before deployment. However, the quality and diversity of test prompts generated by existing methods are still far from satisfactory. Not only are these methods labor-intensive and require large budget costs, but the controllability of test prompt generation is lacking for the specific testing domain of LLM applications. With the idea of LLM for LLM testing, we propose the first LLM, called TroubleLLM, to generate controllable test prompts on LLM safety issues. Extensive experiments and human evaluation illustrate the superiority of TroubleLLM on generation quality and generation controllability.
- [14] arXiv:2403.00830 [ pdf , ps , html , other ]
-
Title: MedAide: Leveraging Large Language Models for On-Premise Medical Assistance on Edge DevicesComments: 7 pages, 11 figures, ACM conference paper, 33 referencesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Large language models (LLMs) are revolutionizing various domains with their remarkable natural language processing (NLP) abilities. However, deploying LLMs in resource-constrained edge computing and embedded systems presents significant challenges. Another challenge lies in delivering medical assistance in remote areas with limited healthcare facilities and infrastructure. To address this, we introduce MedAide, an on-premise healthcare chatbot. It leverages tiny-LLMs integrated with LangChain, providing efficient edge-based preliminary medical diagnostics and support. MedAide employs model optimizations for minimal memory footprint and latency on embedded edge devices without server infrastructure. The training process is optimized using low-rank adaptation (LoRA). Additionally, the model is trained on diverse medical datasets, employing reinforcement learning from human feedback (RLHF) to enhance its domain-specific capabilities. The system is implemented on various consumer GPUs and Nvidia Jetson development board. MedAide achieves 77\% accuracy in medical consultations and scores 56 in USMLE benchmark, enabling an energy-efficient healthcare assistance platform that alleviates privacy concerns due to edge-based deployment, thereby empowering the community.
- [15] arXiv:2403.00833 [ pdf , ps , html , other ]
-
Title: Position Paper: Agent AI Towards a Holistic IntelligenceQiuyuan Huang , Naoki Wake , Bidipta Sarkar , Zane Durante , Ran Gong , Rohan Taori , Yusuke Noda , Demetri Terzopoulos , Noboru Kuno , Ade Famoti , Ashley Llorens , John Langford , Hoi Vo , Li Fei-Fei , Katsu Ikeuchi , Jianfeng GaoComments: 22 pages, 4 figures. arXiv admin note: substantial text overlap with arXiv:2401.03568Subjects: Artificial Intelligence (cs.AI)
Abstract: Recent advancements in large foundation models have remarkably enhanced our understanding of sensory information in open-world environments. In leveraging the power of foundation models, it is crucial for AI research to pivot away from excessive reductionism and toward an emphasis on systems that function as cohesive wholes. Specifically, we emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions. The emerging field of Agent AI spans a wide range of existing embodied and agent-based multimodal interactions, including robotics, gaming, and healthcare systems, etc. In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model. On top of this idea, we discuss how agent AI exhibits remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Furthermore, we discuss the potential of Agent AI from an interdisciplinary perspective, underscoring AI cognition and consciousness within scientific discourse. We believe that those discussions serve as a basis for future research directions and encourage broader societal engagement.
- [16] arXiv:2403.00839 [ pdf , ps , other ]
-
Title: ToolNet: Connecting Large Language Models with Massive Tools via Tool GraphSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: While achieving remarkable progress in a broad range of tasks, large language models (LLMs) remain significantly limited in properly using massive external tools. Existing in-context learning approaches simply format tools into a list of plain text descriptions and input them to LLMs, from which, LLMs generate a sequence of tool calls to solve problems step by step. Such a paradigm ignores the intrinsic dependency between tools and offloads all reasoning loads to LLMs, making them restricted to a limited number of specifically designed tools. It thus remains challenging for LLMs to operate on a library of massive tools, casting a great limitation when confronted with real-world scenarios. This paper proposes ToolNet, a plug-and-play framework that scales up the number of tools to thousands with a moderate increase in token consumption. ToolNet organizes tools into a directed graph. Each node represents a tool, and weighted edges denote tool transition. Starting from an initial tool node, an LLM navigates in the graph by iteratively choosing the next one from its successors until the task is resolved. Extensive experiments show that ToolNet can achieve impressive results in challenging multi-hop tool learning datasets and is resilient to tool failures.
- [17] arXiv:2403.00859 [ pdf , ps , html , other ]
-
Title: Team Formation amidst ConflictsSubjects: Artificial Intelligence (cs.AI) ; Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI)
Abstract: In this work, we formulate the problem of team formation amidst conflicts. The goal is to assign individuals to tasks, with given capacities, taking into account individuals' task preferences and the conflicts between them. Using dependent rounding schemes as our main toolbox, we provide efficient approximation algorithms. Our framework is extremely versatile and can model many different real-world scenarios as they arise in educational settings and human-resource management. We test and deploy our algorithms on real-world datasets and we show that our algorithms find assignments that are better than those found by natural baselines. In the educational setting we also show how our assignments are far better than those done manually by human experts. In the human resource management application we show how our assignments increase the diversity of teams. Finally, using a synthetic dataset we demonstrate that our algorithms scale very well in practice.
- [18] arXiv:2403.00861 [ pdf , ps , html , other ]
-
Title: Pivoting Retail Supply Chain with Deep Generative Techniques: Taxonomy, Survey and InsightsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Generative AI applications, such as ChatGPT or DALL-E, have shown the world their impressive capabilities in generating human-like text or image. Diving deeper, the science stakeholder for those AI applications are Deep Generative Models, a.k.a DGMs, which are designed to learn the underlying distribution of the data and generate new data points that are statistically similar to the original dataset. One critical question is raised: how can we leverage DGMs into morden retail supply chain realm? To address this question, this paper expects to provide a comprehensive review of DGMs and discuss their existing and potential usecases in retail supply chain, by (1) providing a taxonomy and overview of state-of-the-art DGMs and their variants, (2) reviewing existing DGM applications in retail supply chain from a end-to-end view of point, and (3) discussing insights and potential directions on how DGMs can be further utilized on solving retail supply chain problems.
- [19] arXiv:2403.00898 [ pdf , ps , html , other ]
-
Title: The Algorithm Configuration ProblemJournal-ref: In: Pardalos, P.M., Prokopyev, O.A. (eds) Encyclopedia of Optimization. Springer, Cham. (2023)Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Optimization and Control (math.OC)
Abstract: The field of algorithmic optimization has significantly advanced with the development of methods for the automatic configuration of algorithmic parameters. This article delves into the Algorithm Configuration Problem, focused on optimizing parametrized algorithms for solving specific instances of decision/optimization problems. We present a comprehensive framework that not only formalizes the Algorithm Configuration Problem, but also outlines different approaches for its resolution, leveraging machine learning models and heuristic strategies. The article categorizes existing methodologies into per-instance and per-problem approaches, distinguishing between offline and online strategies for model construction and deployment. By synthesizing these approaches, we aim to provide a clear pathway for both understanding and addressing the complexities inherent in algorithm configuration.
- [20] arXiv:2403.00980 [ pdf , ps , html , other ]
-
Title: Even-Ifs From If-Onlys: Are the Best Semi-Factual Explanations Found Using Counterfactuals As Guides?Comments: 16 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Recently, counterfactuals using "if-only" explanations have become very popular in eXplainable AI (XAI), as they describe which changes to feature-inputs of a black-box AI system result in changes to a (usually negative) decision-outcome. Even more recently, semi-factuals using "even-if" explanations have gained more attention. They elucidate the feature-input changes that do not change the decision-outcome of the AI system, with a potential to suggest more beneficial recourses. Some semi-factual methods use counterfactuals to the query-instance to guide semi-factual production (so-called counterfactual-guided methods), whereas others do not (so-called counterfactual-free methods). In this work, we perform comprehensive tests of 8 semi-factual methods on 7 datasets using 5 key metrics, to determine whether counterfactual guidance is necessary to find the best semi-factuals. The results of these tests suggests not, but rather that computing other aspects of the decision space lead to better semi-factual XAI.
- [21] arXiv:2403.01199 [ pdf , ps , other ]
-
Title: The Case for Animal-Friendly AIComments: AAAI 2024 Workshop on Public Sector LLMs: Algorithmic and Sociotechnical Design. 12 pages, 11 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Artificial intelligence is seen as increasingly important, and potentially profoundly so, but the fields of AI ethics and AI engineering have not fully recognized that these technologies, including large language models (LLMs), will have massive impacts on animals. We argue that this impact matters, because animals matter morally.
As a first experiment in evaluating animal consideration in LLMs, we constructed a proof-of-concept Evaluation System, which assesses LLM responses and biases from multiple perspectives. This system evaluates LLM outputs by two criteria: their truthfulness, and the degree of consideration they give to the interests of animals. We tested OpenAI ChatGPT 4 and Anthropic Claude 2.1 using a set of structured queries and predefined normative perspectives. Preliminary results suggest that the outcomes of the tested models can be benchmarked regarding the consideration they give to animals, and that generated positions and biases might be addressed and mitigated with more developed and validated systems.
Our research contributes one possible approach to integrating animal ethics in AI, opening pathways for future studies and practical applications in various fields, including education, public policy, and regulation, that involve or relate to animals and society. Overall, this study serves as a step towards more useful and responsible AI systems that better recognize and respect the vital interests and perspectives of all sentient beings. - [22] arXiv:2403.01508 [ pdf , ps , html , other ]
-
Title: Soft Reasoning on Uncertain Knowledge GraphsComments: 10 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: The study of machine learning-based logical query-answering enables reasoning with large-scale and incomplete knowledge graphs. This paper further advances this line of research by considering the uncertainty in the knowledge. The uncertain nature of knowledge is widely observed in the real world, but \textit{does not} align seamlessly with the first-order logic underpinning existing studies. To bridge this gap, we study the setting of soft queries on uncertain knowledge, which is motivated by the establishment of soft constraint programming. We further propose an ML-based approach with both forward inference and backward calibration to answer soft queries on large-scale, incomplete, and uncertain knowledge graphs. Theoretical discussions present that our methods share the same complexity as state-of-the-art inference algorithms for first-order queries. Empirical results justify the superior performance of our approach against previous ML-based methods with number embedding extensions.
- [23] arXiv:2403.01757 [ pdf , ps , html , other ]
-
Title: How Multimodal Integration Boost the Performance of LLM for Optimization: Case Study on Capacitated Vehicle Routing ProblemsComments: 8pages,3 figures, 2 tablesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
Abstract: Recently, large language models (LLMs) have notably positioned them as capable tools for addressing complex optimization challenges. Despite this recognition, a predominant limitation of existing LLM-based optimization methods is their struggle to capture the relationships among decision variables when relying exclusively on numerical text prompts, especially in high-dimensional problems. Keeping this in mind, we first propose to enhance the optimization performance using multimodal LLM capable of processing both textual and visual prompts for deeper insights of the processed optimization problem. This integration allows for a more comprehensive understanding of optimization problems, akin to human cognitive processes. We have developed a multimodal LLM-based optimization framework that simulates human problem-solving workflows, thereby offering a more nuanced and effective analysis. The efficacy of this method is evaluated through extensive empirical studies focused on a well-known combinatorial optimization problem, i.e., capacitated vehicle routing problem. The results are compared against those obtained from the LLM-based optimization algorithms that rely solely on textual prompts, demonstrating the significant advantages of our multimodal approach.
- [24] arXiv:2403.01784 [ pdf , ps , html , other ]
-
Title: CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and TextComments: 10 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI) ; Programming Languages (cs.PL)
Abstract: Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text. Evaluation based on such $\textit{mixture}$ can lead to a more comprehensive understanding of the models' abilities in solving coding problems. However, in this context, current evaluation methods are either limited in task coverage or lack standardization. To address this issue, we propose using category theory as a framework for evaluation. Specifically, morphisms within a code category can represent code debugging and transformation, functors between two categories represent code translation, and functors between a code category and a natural language category represent code generation, explanation, and reproduction. We present an automatic evaluation framework called $\textbf{CatCode}$ ($\textbf{Cat}$egory $\textbf{Code}$) that can comprehensively assess the coding abilities of LLMs, including ChatGPT, Text-Davinci, and CodeGeeX.
- [25] arXiv:2403.01816 [ pdf , ps , html , other ]
-
Title: SMAUG: A Sliding Multidimensional Task Window-Based MARL Framework for Adaptive Real-Time Subtask RecognitionSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: Instead of making behavioral decisions directly from the exponentially expanding joint observational-action space, subtask-based multi-agent reinforcement learning (MARL) methods enable agents to learn how to tackle different subtasks. Most existing subtask-based MARL methods are based on hierarchical reinforcement learning (HRL). However, these approaches often limit the number of subtasks, perform subtask recognition periodically, and can only identify and execute a specific subtask within the predefined fixed time period, which makes them inflexible and not suitable for diverse and dynamic scenarios with constantly changing subtasks. To break through above restrictions, a \textbf{S}liding \textbf{M}ultidimensional t\textbf{A}sk window based m\textbf{U}ti-agent reinforcement learnin\textbf{G} framework (SMAUG) is proposed for adaptive real-time subtask recognition. It leverages a sliding multidimensional task window to extract essential information of subtasks from trajectory segments concatenated based on observed and predicted trajectories in varying lengths. An inference network is designed to iteratively predict future trajectories with the subtask-oriented policy network. Furthermore, intrinsic motivation rewards are defined to promote subtask exploration and behavior diversity. SMAUG can be integrated with any Q-learning-based approach. Experiments on StarCraft II show that SMAUG not only demonstrates performance superiority in comparison with all baselines but also presents a more prominent and swift rise in rewards during the initial training stage.
- [26] arXiv:2403.01832 [ pdf , ps , html , other ]
-
Title: Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial PragmatismComments: Accepted for Data-centric Machine Learning Research (DMLR) Workshop at ICLR 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This paper delves into the contrasting roles of data within academic and industrial spheres, highlighting the divergence between Data-Centric AI and Model-Agnostic AI approaches. We argue that while Data-Centric AI focuses on the primacy of high-quality data for model performance, Model-Agnostic AI prioritizes algorithmic flexibility, often at the expense of data quality considerations. This distinction reveals that academic standards for data quality frequently do not meet the rigorous demands of industrial applications, leading to potential pitfalls in deploying academic models in real-world settings. Through a comprehensive analysis, we address these disparities, presenting both the challenges they pose and strategies for bridging the gap. Furthermore, we propose a novel paradigm: Model-Based Data-Centric AI, which aims to reconcile these differences by integrating model considerations into data optimization processes. This approach underscores the necessity for evolving data requirements that are sensitive to the nuances of both academic research and industrial deployment. By exploring these discrepancies, we aim to foster a more nuanced understanding of data's role in AI development and encourage a convergence of academic and industrial standards to enhance AI's real-world applicability.
- [27] arXiv:2403.01888 [ pdf , ps , html , other ]
-
Title: Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost BenchmarksComments: Submitted to AutoML Conference 2024 ABCD TrackSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: While deep learning has celebrated many successes, its results often hinge on the meticulous selection of hyperparameters (HPs). However, the time-consuming nature of deep learning training makes HP optimization (HPO) a costly endeavor, slowing down the development of efficient HPO tools. While zero-cost benchmarks, which provide performance and runtime without actual training, offer a solution for non-parallel setups, they fall short in parallel setups as each worker must communicate its queried runtime to return its evaluation in the exact order. This work addresses this challenge by introducing a user-friendly Python package that facilitates efficient parallel HPO with zero-cost benchmarks. Our approach calculates the exact return order based on the information stored in file system, eliminating the need for long waiting times and enabling much faster HPO evaluations. We first verify the correctness of our approach through extensive testing and the experiments with 6 popular HPO libraries show its applicability to diverse libraries and its ability to achieve over 1000x speedup compared to a traditional approach. Our package can be installed via pip install mfhpo-simulator.
- [28] arXiv:2403.02053 [ pdf , ps , other ]
-
Title: A Scoping Review of Energy-Efficient Driving Behaviors and Applied State-of-the-Art AI MethodsJournal-ref: Energies 2024, 17, 500Subjects: Artificial Intelligence (cs.AI)
Abstract: The transportation sector remains a major contributor to greenhouse gas emissions. The understanding of energy-efficient driving behaviors and utilization of energy-efficient driving strategies are essential to reduce vehicles' fuel consumption. However, there is no comprehensive investigation into energy-efficient driving behaviors and strategies. Furthermore, many state-of-the-art AI models have been applied for the analysis of eco-friendly driving styles, but no overview is available. To fill the gap, this paper conducts a thorough literature review on ecological driving behaviors and styles and analyzes the driving factors influencing energy consumption and state-of-the-art methodologies. With a thorough scoping review process, the methodological and related data are compared. The results show that the factors that impact driving behaviors can be summarized into eleven features including speed, acceleration, deceleration, pedal, and so on. This paper finds that supervised/unsupervised learning algorithms and reinforcement learning frameworks have been popularly used to model the vehicle's energy consumption with multi-dimensional data. Furthermore, the literature shows that the driving data are collected from either simulators or real-world experiments, and the real-world data are mainly stored and transmitted by meters, controller area networks, onboard data services, smartphones, and additional sensors installed in the vehicle. Based on driving behavior factors, driver characteristics, and safety rules, this paper recommends nine energy-efficient driving styles including four guidelines for the drivers' selection and adjustment of the vehicle parameters, three recommendations for the energy-efficient driving styles in different driving scenarios, and two subjective suggestions for different types of drivers and employers.
- [29] arXiv:2403.02054 [ pdf , ps , html , other ]
-
Title: Large Language Model-Based Evolutionary Optimizer: Reasoning with elitismShuvayan Brahmachary , Subodh M. Joshi , Aniruddha Panda , Kaushik Koneripalli , Arun Kumar Sagotra , Harshil Patel , Ankush Sharma , Ameya D. Jagtap , Kaushic KalyanaramanSubjects: Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, prompting interest in their application as black-box optimizers. This paper asserts that LLMs possess the capability for zero-shot optimization across diverse scenarios, including multi-objective and high-dimensional problems. We introduce a novel population-based method for numerical optimization using LLMs called Language-Model-Based Evolutionary Optimizer (LEO). Our hypothesis is supported through numerical examples, spanning benchmark and industrial engineering problems such as supersonic nozzle shape optimization, heat transfer, and windfarm layout optimization. We compare our method to several gradient-based and gradient-free optimization approaches. While LLMs yield comparable results to state-of-the-art methods, their imaginative nature and propensity to hallucinate demand careful handling. We provide practical guidelines for obtaining reliable answers from LLMs and discuss method limitations and potential research directions.
- [30] arXiv:2403.02164 [ pdf , ps , other ]
-
Title: Cognition is All You Need -- The Next Layer of AI Above Large Language ModelsComments: 63 pages, 18 figuresSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: Recent studies of the applications of conversational AI tools, such as chatbots powered by large language models, to complex real-world knowledge work have shown limitations related to reasoning and multi-step problem solving. Specifically, while existing chatbots simulate shallow reasoning and understanding they are prone to errors as problem complexity increases. The failure of these systems to address complex knowledge work is due to the fact that they do not perform any actual cognition. In this position paper, we present Cognitive AI, a higher-level framework for implementing programmatically defined neuro-symbolic cognition above and outside of large language models. Specifically, we propose a dual-layer functional architecture for Cognitive AI that serves as a roadmap for AI systems that can perform complex multi-step knowledge work. We propose that Cognitive AI is a necessary precursor for the evolution of higher forms of AI, such as AGI, and specifically claim that AGI cannot be achieved by probabilistic approaches on their own. We conclude with a discussion of the implications for large language models, adoption cycles in AI, and commercial Cognitive AI development.
- [31] arXiv:2403.02290 [ pdf , ps , html , other ]
-
Title: Koopman-Assisted Reinforcement LearningComments: 35 pages, 12 figuresSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
Abstract: The Bellman equation and its continuous form, the Hamilton-Jacobi-Bellman (HJB) equation, are ubiquitous in reinforcement learning (RL) and control theory. However, these equations quickly become intractable for systems with high-dimensional states and nonlinearity. This paper explores the connection between the data-driven Koopman operator and Markov Decision Processes (MDPs), resulting in the development of two new RL algorithms to address these limitations. We leverage Koopman operator techniques to lift a nonlinear system into new coordinates where the dynamics become approximately linear, and where HJB-based methods are more tractable. In particular, the Koopman operator is able to capture the expectation of the time evolution of the value function of a given system via linear dynamics in the lifted coordinates. By parameterizing the Koopman operator with the control actions, we construct a ``Koopman tensor'' that facilitates the estimation of the optimal value function. Then, a transformation of Bellman's framework in terms of the Koopman tensor enables us to reformulate two max-entropy RL algorithms: soft value iteration and soft actor-critic (SAC). This highly flexible framework can be used for deterministic or stochastic systems as well as for discrete or continuous-time dynamics. Finally, we show that these Koopman Assisted Reinforcement Learning (KARL) algorithms attain state-of-the-art (SOTA) performance with respect to traditional neural network-based SAC and linear quadratic regulator (LQR) baselines on four controlled dynamical systems: a linear state-space system, the Lorenz system, fluid flow past a cylinder, and a double-well potential with non-isotropic stochastic forcing.
- [32] arXiv:2403.02454 [ pdf , ps , html , other ]
-
Title: The Ink Splotch Effect: A Case Study on ChatGPT as a Co-Creative Game DesignerComments: 12 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper studies how large language models (LLMs) can act as effective, high-level creative collaborators and ``muses'' for game design. We model the design of this study after the exercises artists use by looking at amorphous ink splotches for creative inspiration. Our goal is to determine whether AI-assistance can improve, hinder, or provide an alternative quality to games when compared to the creative intents implemented by human designers. The capabilities of LLMs as game designers are stress tested by placing it at the forefront of the decision making process. Three prototype games are designed across 3 different genres: (1) a minimalist base game, (2) a game with features and game feel elements added by a human game designer, and (3) a game with features and feel elements directly implemented from prompted outputs of the LLM, ChatGPT. A user study was conducted and participants were asked to blindly evaluate the quality and their preference of these games. We discuss both the development process of communicating creative intent to an AI chatbot and the synthesized open feedback of the participants. We use this data to determine both the benefits and shortcomings of AI in a more design-centric role.
- [33] arXiv:2403.02482 [ pdf , ps , html , other ]
-
Title: MORBDD: Multiobjective Restricted Binary Decision Diagrams by Learning to SparsifySubjects: Artificial Intelligence (cs.AI)
Abstract: In multicriteria decision-making, a user seeks a set of non-dominated solutions to a (constrained) multiobjective optimization problem, the so-called Pareto frontier. In this work, we seek to bring a state-of-the-art method for exact multiobjective integer linear programming into the heuristic realm. We focus on binary decision diagrams (BDDs) which first construct a graph that represents all feasible solutions to the problem and then traverse the graph to extract the Pareto frontier. Because the Pareto frontier may be exponentially large, enumerating it over the BDD can be time-consuming. We explore how restricted BDDs, which have already been shown to be effective as heuristics for single-objective problems, can be adapted to multiobjective optimization through the use of machine learning (ML). MORBDD, our ML-based BDD sparsifier, first trains a binary classifier to eliminate BDD nodes that are unlikely to contribute to Pareto solutions, then post-processes the sparse BDD to ensure its connectivity via optimization. Experimental results on multiobjective knapsack problems show that MORBDD is highly effective at producing very small restricted BDDs with excellent approximation quality, outperforming width-limited restricted BDDs and the well-known evolutionary algorithm NSGA-II.
- [34] arXiv:2403.02523 [ pdf , ps , html , other ]
-
Title: Transformer for Times Series: an Application to the S&P500Subjects: Artificial Intelligence (cs.AI) ; Portfolio Management (q-fin.PM); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
Abstract: The transformer models have been extensively used with good results in a wide area of machine learning applications including Large Language Models and image generation. Here, we inquire on the applicability of this approach to financial time series. We first describe the dataset construction for two prototypical situations: a mean reverting synthetic Ornstein-Uhlenbeck process on one hand and real S&P500 data on the other hand. Then, we present in detail the proposed Transformer architecture and finally we discuss some encouraging results. For the synthetic data we predict rather accurately the next move, and for the S&P500 we get some interesting results related to quadratic variation and volatility prediction.
- [35] arXiv:2403.02610 [ pdf , ps , html , other ]
-
Title: ChatGPT4PCG 2 Competition: Prompt Engineering for Science Birds Level GenerationPittawat Taveekitworachai , Febri Abdullah , Mury F. Dewantoro , Yi Xia , Pratch Suntichaikul , Ruck Thawonmas , Julian Togelius , Jochen RenzSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper presents the second ChatGPT4PCG competition at the 2024 IEEE Conference on Games. In this edition of the competition, we follow the first edition, but make several improvements and changes. We introduce a new evaluation metric along with allowing a more flexible format for participants' submissions and making several improvements to the evaluation pipeline. Continuing from the first edition, we aim to foster and explore the realm of prompt engineering (PE) for procedural content generation (PCG). While the first competition saw success, it was hindered by various limitations; we aim to mitigate these limitations in this edition. We introduce diversity as a new metric to discourage submissions aimed at producing repetitive structures. Furthermore, we allow submission of a Python program instead of a prompt text file for greater flexibility in implementing advanced PE approaches, which may require control flow, including conditions and iterations. We also make several improvements to the evaluation pipeline with a better classifier for similarity evaluation and better-performing function signatures. We thoroughly evaluate the effectiveness of the new metric and the improved classifier. Additionally, we perform an ablation study to select a function signature to instruct ChatGPT for level generation. Finally, we provide implementation examples of various PE techniques in Python and evaluate their preliminary performance. We hope this competition serves as a resource and platform for learning about PE and PCG in general.
- [36] arXiv:2403.02635 [ pdf , ps , html , other ]
-
Title: PPS-QMIX: Periodically Parameter Sharing for Accelerating Convergence of Multi-Agent Reinforcement LearningComments: 10 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Training for multi-agent reinforcement learning(MARL) is a time-consuming process caused by distribution shift of each agent. One drawback is that strategy of each agent in MARL is independent but actually in cooperation. Thus, a vertical issue in multi-agent reinforcement learning is how to efficiently accelerate training process. To address this problem, current research has leveraged a centralized function(CF) across multiple agents to learn contribution of the team reward for each agent. However, CF based methods introduce joint error from other agents in estimation of value network. In so doing, inspired by federated learning, we propose three simple novel approaches called Average Periodically Parameter Sharing(A-PPS), Reward-Scalability Periodically Parameter Sharing(RS-PPS) and Partial Personalized Periodically Parameter Sharing(PP-PPS) mechanism to accelerate training of MARL. Agents share Q-value network periodically during the training process. Agents which has same identity adapt collected reward as scalability and update partial neural network during period to share different parameters. We apply our approaches in classical MARL method QMIX and evaluate our approaches on various tasks in StarCraft Multi-Agent Challenge(SMAC) environment. Performance of numerical experiments yield enormous enhancement, with an average improvement of 10\%-30\%, and enable to win tasks that QMIX cannot. Our code can be downloaded from this https URL
- [37] arXiv:2403.02719 [ pdf , ps , html , other ]
-
Title: Multi-Scale Subgraph Contrastive LearningComments: The 32nd International Joint Conference on Artificial Intelligence (IJCAI-2023)Subjects: Artificial Intelligence (cs.AI)
Abstract: Graph-level contrastive learning, aiming to learn the representations for each graph by contrasting two augmented graphs, has attracted considerable attention. Previous studies usually simply assume that a graph and its augmented graph as a positive pair, otherwise as a negative pair. However, it is well known that graph structure is always complex and multi-scale, which gives rise to a fundamental question: after graph augmentation, will the previous assumption still hold in reality? By an experimental analysis, we discover the semantic information of an augmented graph structure may be not consistent as original graph structure, and whether two augmented graphs are positive or negative pairs is highly related with the multi-scale structures. Based on this finding, we propose a multi-scale subgraph contrastive learning architecture which is able to characterize the fine-grained semantic information. Specifically, we generate global and local views at different scales based on subgraph sampling, and construct multiple contrastive relationships according to their semantic associations to provide richer self-supervised signals. Extensive experiments and parametric analyzes on eight graph classification real-world datasets well demonstrate the effectiveness of the proposed method.
- [38] arXiv:2403.02723 [ pdf , ps , html , other ]
-
Title: Minimum Topology Attacks for Graph Neural NetworksComments: Published on WWW 2023. Proceedings of the ACM Web Conference 2023Subjects: Artificial Intelligence (cs.AI)
Abstract: With the great popularity of Graph Neural Networks (GNNs), their robustness to adversarial topology attacks has received significant attention. Although many attack methods have been proposed, they mainly focus on fixed-budget attacks, aiming at finding the most adversarial perturbations within a fixed budget for target node. However, considering the varied robustness of each node, there is an inevitable dilemma caused by the fixed budget, i.e., no successful perturbation is found when the budget is relatively small, while if it is too large, the yielding redundant perturbations will hurt the invisibility. To break this dilemma, we propose a new type of topology attack, named minimum-budget topology attack, aiming to adaptively find the minimum perturbation sufficient for a successful attack on each node. To this end, we propose an attack model, named MiBTack, based on a dynamic projected gradient descent algorithm, which can effectively solve the involving non-convex constraint optimization on discrete topology. Extensive results on three GNNs and four real-world datasets show that MiBTack can successfully lead all target nodes misclassified with the minimum perturbation edges. Moreover, the obtained minimum budget can be used to measure node robustness, so we can explore the relationships of robustness, topology, and uncertainty for nodes, which is beyond what the current fixed-budget topology attacks can offer.
- [39] arXiv:2403.02745 [ pdf , ps , html , other ]
-
Title: CURATRON: Complete Robust Preference Data for Robust Alignment of Large Language ModelsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL), with a focus on the issues of incomplete and corrupted data in preference datasets. We propose a novel method for robustly and completely recalibrating values within these datasets to enhance LLMs resilience against the issues. In particular, we devise a guaranteed polynomial time ranking algorithm that robustifies several existing models, such as the classic Bradley--Terry--Luce (BTL) (Bradley and Terry, 1952) model and certain generalizations of it. To the best of our knowledge, our present work is the first to propose an algorithm that provably recovers an {\epsilon}-optimal ranking with high probability while allowing as large as O(n) perturbed pairwise comparison results per model response. Furthermore, we show robust recovery results in the partially observed setting. Our experiments confirm that our algorithms handle adversarial noise and unobserved comparisons well in both general and LLM preference dataset settings. This work contributes to the development and scaling of more reliable and ethically aligned AI models by equipping the dataset curation pipeline with the ability to handle missing and maliciously manipulated inputs.
- [40] arXiv:2403.02760 [ pdf , ps , other ]
-
Title: Emerging Synergies Between Large Language Models and Machine Learning in Ecommerce RecommendationsSubjects: Artificial Intelligence (cs.AI)
Abstract: With the boom of e-commerce and web applications, recommender systems have become an important part of our daily lives, providing personalized recommendations based on the user's preferences. Although deep neural networks (DNNs) have made significant progress in improving recommendation systems by simulating the interaction between users and items and incorporating their textual information, these DNN-based approaches still have some limitations, such as the difficulty of effectively understanding users' interests and capturing textual information. It is not possible to generalize to different seen/unseen recommendation scenarios and reason about their predictions. At the same time, the emergence of large language models (LLMs), represented by ChatGPT and GPT-4, has revolutionized the fields of natural language processing (NLP) and artificial intelligence (AI) due to their superior capabilities in the basic tasks of language understanding and generation, and their impressive generalization and reasoning capabilities. As a result, recent research has sought to harness the power of LLM to improve recommendation systems. Given the rapid development of this research direction in the field of recommendation systems, there is an urgent need for a systematic review of existing LLM-driven recommendation systems for researchers and practitioners in related fields to gain insight into. More specifically, we first introduced a representative approach to learning user and item representations using LLM as a feature encoder. We then reviewed the latest advances in LLMs techniques for collaborative filtering enhanced recommendation systems from the three paradigms of pre-training, fine-tuning, and prompting. Finally, we had a comprehensive discussion on the future direction of this emerging field.
- [41] arXiv:2403.02775 [ pdf , ps , html , other ]
-
Title: EasyQuant: An Efficient Data-free Quantization Algorithm for LLMsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks. Hence in this work, we explore an important question: Can we design a data-independent quantization method for LLMs to guarantee its generalization performance? In this work, we propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for LLMs. Our observation indicates that two factors: outliers in the weight and quantization ranges, are essential for reducing the quantization error. Therefore, in EasyQuant, we leave the outliers (less than 1%) unchanged and optimize the quantization range to reduce the reconstruction error. With these methods, we surprisingly find that EasyQuant achieves comparable performance to the original model. Since EasyQuant does not depend on any training data, the generalization performance of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented in parallel so that the quantized model could be attained in a few minutes even for LLMs over 100B. To our best knowledge, we are the first work that achieves almost lossless quantization performance for LLMs under a data-independent setting and our algorithm runs over 10 times faster than the data-dependent methods.
- [42] arXiv:2403.02783 [ pdf , ps , html , other ]
-
Title: Where the Really Hard Quadratic Assignment Problems Are: the QAP-SAT instancesJournal-ref: Evolutionary Computation in Combinatorial Optimization Conference (evoCOP), Apr 2024, Aberystwyth, United KingdomSubjects: Artificial Intelligence (cs.AI)
Abstract: The Quadratic Assignment Problem (QAP) is one of the major domains in the field of evolutionary computation, and more widely in combinatorial optimization. This paper studies the phase transition of the QAP, which can be described as a dramatic change in the problem's computational complexity and satisfiability, within a narrow range of the problem parameters. To approach this phenomenon, we introduce a new QAP-SAT design of the initial problem based on submodularity to capture its difficulty with new features. This decomposition is studied experimentally using branch-and-bound and tabu search solvers. A phase transition parameter is then proposed. The critical parameter of phase transition satisfaction and that of the solving effort are shown to be highly correlated for tabu search, thus allowing the prediction of difficult instances.
- [43] arXiv:2403.02795 [ pdf , ps , other ]
-
Title: Evaluating and Optimizing Educational Content with Large Language Model JudgmentsComments: 11 pagesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Creating effective educational materials generally requires expensive and time-consuming studies of student learning outcomes. To overcome this barrier, one idea is to build computational models of student learning and use them to optimize instructional materials. However, it is difficult to model the cognitive processes of learning dynamics. We propose an alternative approach that uses Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes. Specifically, we use GPT-3.5 to evaluate the overall effect of instructional materials on different student groups and find that it can replicate well-established educational findings such as the Expertise Reversal Effect and the Variability Effect. This demonstrates the potential of LMs as reliable evaluators of educational content. Building on this insight, we introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. We apply this approach to create math word problem worksheets aimed at maximizing student learning gains. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences. We conclude by discussing potential divergences between human and LM opinions and the resulting pitfalls of automating instructional design.
- [44] arXiv:2403.02820 [ pdf , ps , html , other ]
-
Title: Reconstruction for Sparse View Tomography of Long Objects Applied to Imaging in the Wood IndustrySubjects: Artificial Intelligence (cs.AI)
Abstract: In the wood industry, logs are commonly quality screened by discrete X-ray scans on a moving conveyor belt from a few source positions. Typically, two-dimensional (2D) slice-wise measurements are obtained by a sequential scanning geometry. Each 2D slice alone does not carry sufficient information for a three-dimensional tomographic reconstruction in which biological features of interest in the log are well preserved. In the present work, we propose a learned iterative reconstruction method based on the Learned Primal-Dual neural network, suited for sequential scanning geometries. Our method accumulates information between neighbouring slices, instead of only accounting for single slices during reconstruction. Our quantitative and qualitative evaluations with as few as five source positions show that our method yields reconstructions of logs that are sufficiently accurate to identify biological features like knots (branches), heartwood and sapwood.
- [45] arXiv:2403.02870 [ pdf , ps , html , other ]
-
Title: Precise Extraction of Deep Learning Models via Side-Channel Attacks on Edge/Endpoint DevicesComments: Accepted by 27th European Symposium on Research in Computer Security (ESORICS 2022)Subjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: With growing popularity, deep learning (DL) models are becoming larger-scale, and only the companies with vast training datasets and immense computing power can manage their business serving such large models. Most of those DL models are proprietary to the companies who thus strive to keep their private models safe from the model extraction attack (MEA), whose aim is to steal the model by training surrogate models. Nowadays, companies are inclined to offload the models from central servers to edge/endpoint devices. As revealed in the latest studies, adversaries exploit this opportunity as new attack vectors to launch side-channel attack (SCA) on the device running victim model and obtain various pieces of the model information, such as the model architecture (MA) and image dimension (ID). Our work provides a comprehensive understanding of such a relationship for the first time and would benefit future MEA studies in both offensive and defensive sides in that they may learn which pieces of information exposed by SCA are more important than the others. Our analysis additionally reveals that by grasping the victim model information from SCA, MEA can get highly effective and successful even without any prior knowledge of the model. Finally, to evince the practicality of our analysis results, we empirically apply SCA, and subsequently, carry out MEA under realistic threat assumptions. The results show up to 5.8 times better performance than when the adversary has no model information about the victim model.
- [46] arXiv:2403.02899 [ pdf , ps , html , other ]
-
Title: Domain-Agnostic Mutual Prompting for Unsupervised Domain AdaptationSubjects: Artificial Intelligence (cs.AI)
Abstract: Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains, which neglects to harness rich semantics from data and struggles to handle complex domain shifts. A promising technique is to leverage the knowledge of large-scale pre-trained vision-language models for more guided adaptation. Despite some endeavors, current methods often learn textual prompts to embed domain semantics for source and target domains separately and perform classification within each domain, limiting cross-domain knowledge transfer. Moreover, prompting only the language branch lacks flexibility to adapt both modalities dynamically. To bridge this gap, we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics by mutually aligning visual and textual embeddings. Specifically, the image contextual information is utilized to prompt the language branch in a domain-agnostic and instance-conditioned way. Meanwhile, visual prompts are imposed based on the domain-agnostic textual prompt to elicit domain-invariant visual embeddings. These two branches of prompts are learned mutually with a cross-attention module and regularized with a semantic-consistency loss and an instance-discrimination contrastive loss. Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches.
- [47] arXiv:2403.02901 [ pdf , ps , html , other ]
-
Title: A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based MethodsSubjects: Artificial Intelligence (cs.AI)
Abstract: Automatic Text Summarization (ATS), utilizing Natural Language Processing (NLP) algorithms, aims to create concise and accurate summaries, thereby significantly reducing the human effort required in processing large volumes of text. ATS has drawn considerable interest in both academic and industrial circles. Many studies have been conducted in the past to survey ATS methods; however, they generally lack practicality for real-world implementations, as they often categorize previous methods from a theoretical standpoint. Moreover, the advent of Large Language Models (LLMs) has altered conventional ATS methods. In this survey, we aim to 1) provide a comprehensive overview of ATS from a ``Process-Oriented Schema'' perspective, which is best aligned with real-world implementations; 2) comprehensively review the latest LLM-based ATS works; and 3) deliver an up-to-date survey of ATS, bridging the two-year gap in the literature. To the best of our knowledge, this is the first survey to specifically investigate LLM-based ATS methods.
- [48] arXiv:2403.02914 [ pdf , ps , html , other ]
-
Title: DynST: Dynamic Sparse Training for Resource-Constrained Spatio-Temporal ForecastingSubjects: Artificial Intelligence (cs.AI)
Abstract: The ever-increasing sensor service, though opening a precious path and providing a deluge of earth system data for deep-learning-oriented earth science, sadly introduce a daunting obstacle to their industrial level deployment. Concretely, earth science systems rely heavily on the extensive deployment of sensors, however, the data collection from sensors is constrained by complex geographical and social factors, making it challenging to achieve comprehensive coverage and uniform deployment. To alleviate the obstacle, traditional approaches to sensor deployment utilize specific algorithms to design and deploy sensors. These methods dynamically adjust the activation times of sensors to optimize the detection process across each sub-region. Regrettably, formulating an activation strategy generally based on historical observations and geographic characteristics, which make the methods and resultant models were neither simple nor practical. Worse still, the complex technical design may ultimately lead to a model with weak generalizability. In this paper, we introduce for the first time the concept of spatio-temporal data dynamic sparse training and are committed to adaptively, dynamically filtering important sensor distributions. To our knowledge, this is the first proposal (termed DynST) of an industry-level deployment optimization concept at the data level. However, due to the existence of the temporal dimension, pruning of spatio-temporal data may lead to conflicts at different timestamps. To achieve this goal, we employ dynamic merge technology, along with ingenious dimensional mapping to mitigate potential impacts caused by the temporal aspect. During the training process, DynST utilize iterative pruning and sparse training, repeatedly identifying and dynamically removing sensor perception areas that contribute the least to future predictions.
- [49] arXiv:2403.02933 [ pdf , ps , html , other ]
-
Title: Fuzzy Datalog$^\exists$ over Arbitrary t-NormsSubjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO)
Abstract: One of the main challenges in the area of Neuro-Symbolic AI is to perform logical reasoning in the presence of both neural and symbolic data. This requires combining heterogeneous data sources such as knowledge graphs, neural model predictions, structured databases, crowd-sourced data, and many more. To allow for such reasoning, we generalise the standard rule-based language Datalog with existential rules (commonly referred to as tuple-generating dependencies) to the fuzzy setting, by allowing for arbitrary t-norms in the place of classical conjunctions in rule bodies. The resulting formalism allows us to perform reasoning about data associated with degrees of uncertainty while preserving computational complexity results and the applicability of reasoning techniques established for the standard Datalog setting. In particular, we provide fuzzy extensions of Datalog chases which produce fuzzy universal models and we exploit them to show that in important fragments of the language, reasoning has the same complexity as in the classical setting.
- [50] arXiv:2403.02936 [ pdf , ps , html , other ]
-
Title: AdAM: Adaptive Fault-Tolerant Approximate Multiplier for Edge DNN AcceleratorsMahdi Taheri , Natalia Cherezova , Samira Nazari , Ahsan Rafiq , Ali Azarpeyvand , Tara Ghasempouri , Masoud Daneshtalab , Jaan Raik , Maksim JenihhinSubjects: Artificial Intelligence (cs.AI) ; Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Abstract: In this paper, we propose an architecture of a novel adaptive fault-tolerant approximate multiplier tailored for ASIC-based DNN accelerators.
- [51] arXiv:2403.02946 [ pdf , ps , html , other ]
-
Title: SAFFIRA: a Framework for Assessing the Reliability of Systolic-Array-Based DNN AcceleratorsMahdi Taheri , Masoud Daneshtalab , Jaan Raik , Maksim Jenihhin , Salvatore Pappalardo , Paul Jimenez , Bastien Deveautour , Alberto BosioSubjects: Artificial Intelligence (cs.AI) ; Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Abstract: Systolic array has emerged as a prominent architecture for Deep Neural Network (DNN) hardware accelerators, providing high-throughput and low-latency performance essential for deploying DNNs across diverse applications. However, when used in safety-critical applications, reliability assessment is mandatory to guarantee the correct behavior of DNN accelerators. While fault injection stands out as a well-established practical and robust method for reliability assessment, it is still a very time-consuming process. This paper addresses the time efficiency issue by introducing a novel hierarchical software-based hardware-aware fault injection strategy tailored for systolic array-based DNN accelerators.
- [52] arXiv:2403.02950 [ pdf , ps , html , other ]
-
Title: A general approach to enhance the survivability of backdoor attacks by decision path couplingSubjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR)
Abstract: Backdoor attacks have been one of the emerging security threats to deep neural networks (DNNs), leading to serious consequences. One of the mainstream backdoor defenses is model reconstruction-based. Such defenses adopt model unlearning or pruning to eliminate backdoors. However, little attention has been paid to survive from such defenses. To bridge the gap, we propose Venom, the first generic backdoor attack enhancer to improve the survivability of existing backdoor attacks against model reconstruction-based defenses. We formalize Venom as a binary-task optimization problem. The first is the original backdoor attack task to preserve the original attack capability, while the second is the attack enhancement task to improve the attack survivability. To realize the second task, we propose attention imitation loss to force the decision path of poisoned samples in backdoored models to couple with the crucial decision path of benign samples, which makes backdoors difficult to eliminate. Our extensive evaluation on two DNNs and three datasets has demonstrated that Venom significantly improves the survivability of eight state-of-the-art attacks against eight state-of-the-art defenses without impacting the capability of the original attacks.
- [53] arXiv:2403.02962 [ pdf , ps , html , other ]
-
Title: WikiTableEdit: A Benchmark for Table Editing by Natural Language InstructionSubjects: Artificial Intelligence (cs.AI)
Abstract: Tabular data, as a crucial form of data representation, exists in diverse formats on the Web. When confronted with complex and irregular tables, manual modification becomes a laborious task. This paper investigates the performance of Large Language Models (LLMs) in the context of table editing tasks. Existing research mainly focuses on regular-shaped tables, wherein instructions are used to generate code in SQL, Python, or Excel Office-script for manipulating the tables. Nevertheless, editing tables with irregular structures, particularly those containing merged cells spanning multiple rows, poses a challenge when using code. To address this, we introduce the WikiTableEdit dataset. Leveraging 26,531 tables from the WikiSQL dataset, we automatically generate natural language instructions for six distinct basic operations and the corresponding outcomes, resulting in over 200,000 instances. Subsequently, we evaluate several representative large language models on the WikiTableEdit dataset to demonstrate the challenge of this task. The dataset will be released to the community to promote related researches.
- [54] arXiv:2403.02985 [ pdf , ps , html , other ]
-
Title: Evolution Transformer: In-Context Evolutionary OptimizationSubjects: Artificial Intelligence (cs.AI) ; Neural and Evolutionary Computing (cs.NE)
Abstract: Evolutionary optimization algorithms are often derived from loose biological analogies and struggle to leverage information obtained during the sequential course of optimization. An alternative promising approach is to leverage data and directly discover powerful optimization principles via meta-optimization. In this work, we follow such a paradigm and introduce Evolution Transformer, a causal Transformer architecture, which can flexibly characterize a family of Evolution Strategies. Given a trajectory of evaluations and search distribution statistics, Evolution Transformer outputs a performance-improving update to the search distribution. The architecture imposes a set of suitable inductive biases, i.e. the invariance of the distribution update to the order of population members within a generation and equivariance to the order of the search dimensions. We train the model weights using Evolutionary Algorithm Distillation, a technique for supervised optimization of sequence models using teacher algorithm trajectories. The resulting model exhibits strong in-context optimization performance and shows strong generalization capabilities to otherwise challenging neuroevolution tasks. We analyze the resulting properties of the Evolution Transformer and propose a technique to fully self-referentially train the Evolution Transformer, starting from a random initialization and bootstrapping its own learning progress. We provide an open source implementation under this https URL .
- [55] arXiv:2403.02993 [ pdf , ps , html , other ]
-
Title: Localized Zeroth-Order Prompt OptimizationWenyang Hu , Yao Shu , Zongmin Yu , Zhaoxuan Wu , Xiangqiang Lin , Zhongxiang Dai , See-Kiong Ng , Bryan Kian Hsiang LowSubjects: Artificial Intelligence (cs.AI)
Abstract: The efficacy of large language models (LLMs) in understanding and generating natural language has aroused a wide interest in developing prompt-based methods to harness the power of black-box LLMs. Existing methodologies usually prioritize a global optimization for finding the global optimum, which however will perform poorly in certain tasks. This thus motivates us to re-think the necessity of finding a global optimum in prompt optimization. To answer this, we conduct a thorough empirical study on prompt optimization and draw two major insights. Contrasting with the rarity of global optimum, local optima are usually prevalent and well-performed, which can be more worthwhile for efficient prompt optimization (Insight I). The choice of the input domain, covering both the generation and the representation of prompts, affects the identification of well-performing local optima (Insight II). Inspired by these insights, we propose a novel algorithm, namely localized zeroth-order prompt optimization (ZOPO), which incorporates a Neural Tangent Kernel-based derived Gaussian process into standard zeroth-order optimization for an efficient search of well-performing local optima in prompt optimization. Remarkably, ZOPO outperforms existing baselines in terms of both the optimization performance and the query efficiency, which we demonstrate through extensive experiments.
- [56] arXiv:2403.03008 [ pdf , ps , html , other ]
-
Title: Knowledge Graphs as Context Sources for LLM-Based Explanations of Learning RecommendationsSubjects: Artificial Intelligence (cs.AI)
Abstract: In the era of personalized education, the provision of comprehensible explanations for learning recommendations is of a great value to enhance the learner's understanding and engagement with the recommended learning content. Large language models (LLMs) and generative AI in general have recently opened new doors for generating human-like explanations, for and along learning recommendations. However, their precision is still far away from acceptable in a sensitive field like education. To harness the abilities of LLMs, while still ensuring a high level of precision towards the intent of the learners, this paper proposes an approach to utilize knowledge graphs (KG) as a source of factual context, for LLM prompts, reducing the risk of model hallucinations, and safeguarding against wrong or imprecise information, while maintaining an application-intended learning context. We utilize the semantic relations in the knowledge graph to offer curated knowledge about learning recommendations. With domain-experts in the loop, we design the explanation as a textual template, which is filled and completed by the LLM. Domain experts were integrated in the prompt engineering phase as part of a study, to ensure that explanations include information that is relevant to the learner. We evaluate our approach quantitatively using Rouge-N and Rouge-L measures, as well as qualitatively with experts and learners. Our results show an enhanced recall and precision of the generated explanations compared to those generated solely by the GPT model, with a greatly reduced risk of generating imprecise information in the final learning explanation.
- [57] arXiv:2403.03017 [ pdf , ps , html , other ]
-
Title: OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction FollowingSubjects: Artificial Intelligence (cs.AI)
Abstract: Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.
- [58] arXiv:2403.03028 [ pdf , ps , html , other ]
-
Title: Word Importance Explains How Prompts Affect Language Model OutputsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: The emergence of large language models (LLMs) has revolutionized numerous applications across industries. However, their "black box" nature often hinders the understanding of how they make specific decisions, raising concerns about their transparency, reliability, and ethical use. This study presents a method to improve the explainability of LLMs by varying individual words in prompts to uncover their statistical impact on the model outputs. This approach, inspired by permutation importance for tabular data, masks each word in the system prompt and evaluates its effect on the outputs based on the available text scores aggregated over multiple user inputs. Unlike classical attention, word importance measures the impact of prompt words on arbitrarily-defined text scores, which enables decomposing the importance of words into the specific measures of interest--including bias, reading level, verbosity, etc. This procedure also enables measuring impact when attention weights are not available. To test the fidelity of this approach, we explore the effect of adding different suffixes to multiple different system prompts and comparing subsequent generations with different large language models. Results show that word importance scores are closely related to the expected suffix importances for multiple scoring functions.
- [59] arXiv:2403.03165 [ pdf , ps , other ]
-
Title: Leveraging Federated Learning and Edge Computing for Recommendation Systems within Cloud Computing NetworksSubjects: Artificial Intelligence (cs.AI)
Abstract: To enable large-scale and efficient deployment of artificial intelligence (AI), the combination of AI and edge computing has spawned Edge Intelligence, which leverages the computing and communication capabilities of end devices and edge servers to process data closer to where it is generated. A key technology for edge intelligence is the privacy-protecting machine learning paradigm known as Federated Learning (FL), which enables data owners to train models without having to transfer raw data to third-party servers. However, FL networks are expected to involve thousands of heterogeneous distributed devices. As a result, communication efficiency remains a key bottleneck. To reduce node failures and device exits, a Hierarchical Federated Learning (HFL) framework is proposed, where a designated cluster leader supports the data owner through intermediate model aggregation. Therefore, based on the improvement of edge server resource utilization, this paper can effectively make up for the limitation of cache capacity. In order to mitigate the impact of soft clicks on the quality of user experience (QoE), the authors model the user QoE as a comprehensive system cost. To solve the formulaic problem, the authors propose a decentralized caching algorithm with federated deep reinforcement learning (DRL) and federated learning (FL), where multiple agents learn and make decisions independently
- [60] arXiv:2403.03172 [ pdf , ps , html , other ]
-
Title: Reaching Consensus in Cooperative Multi-Agent Reinforcement Learning with Goal ImaginationLiangzhou Wang , Kaiwen Zhu , Fengming Zhu , Xinghu Yao , Shujie Zhang , Deheng Ye , Haobo Fu , Qiang Fu , Wei YangSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Reaching consensus is key to multi-agent coordination. To accomplish a cooperative task, agents need to coherently select optimal joint actions to maximize the team reward. However, current cooperative multi-agent reinforcement learning (MARL) methods usually do not explicitly take consensus into consideration, which may cause miscoordination problem. In this paper, we propose a model-based consensus mechanism to explicitly coordinate multiple agents. The proposed Multi-agent Goal Imagination (MAGI) framework guides agents to reach consensus with an Imagined common goal. The common goal is an achievable state with high value, which is obtained by sampling from the distribution of future states. We directly model this distribution with a self-supervised generative model, thus alleviating the "curse of dimensinality" problem induced by multi-agent multi-step policy rollout commonly used in model-based methods. We show that such efficient consensus mechanism can guide all agents cooperatively reaching valuable future states. Results on Multi-agent Particle-Environments and Google Research Football environment demonstrate the superiority of MAGI in both sample efficiency and performance.
- [61] arXiv:2403.03176 [ pdf , ps , html , other ]
-
Title: Unifying and Certifying Top-Quality PlanningComments: To appear at ICAPS 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: The growing utilization of planning tools in practical scenarios has sparked an interest in generating multiple high-quality plans. Consequently, a range of computational problems under the general umbrella of top-quality planning were introduced over a short time period, each with its own definition. In this work, we show that the existing definitions can be unified into one, based on a dominance relation. The different computational problems, therefore, simply correspond to different dominance relations. Given the unified definition, we can now certify the top-quality of the solutions, leveraging existing certification of unsolvability and optimality. We show that task transformations found in the existing literature can be employed for the efficient certification of various top-quality planning problems and propose a novel transformation to efficiently certify loopless top-quality planning.
- [62] arXiv:2403.03186 [ pdf , ps , html , other ]
-
Title: Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case StudyWeihao Tan , Ziluo Ding , Wentao Zhang , Boyu Li , Bohan Zhou , Junpeng Yue , Haochong Xia , Jiechuan Jiang , Longtao Zheng , Xinrun Xu , Yifei Bi , Pengjie Gu , Xinrun Wang , Börje F. Karlsson , Bo An , Zongqing LuSubjects: Artificial Intelligence (cs.AI)
Abstract: Despite the success in specific tasks and scenarios, existing foundation agents, empowered by large models (LMs) and advanced tools, still cannot generalize to different scenarios, mainly due to dramatic differences in the observations and actions across scenarios. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking only screen images (and possibly audio) of the computer as input, and producing keyboard and mouse operations as output, similar to human-computer interaction. The main challenges of achieving GCC are: 1) the multimodal observations for decision-making, 2) the requirements of accurate control of keyboard and mouse, 3) the need for long-term memory and reasoning, and 4) the abilities of efficient exploration and self-improvement. To target GCC, we introduce Cradle, an agent framework with six main modules, including: 1) information gathering to extract multi-modality information, 2) self-reflection to rethink past experiences, 3) task inference to choose the best next task, 4) skill curation for generating and updating relevant skills for given tasks, 5) action planning to generate specific operations for keyboard and mouse control, and 6) memory for storage and retrieval of past experiences and known skills. To demonstrate the capabilities of generalization and self-improvement of Cradle, we deploy it in the complex AAA game Red Dead Redemption II, serving as a preliminary attempt towards GCC with a challenging target. To our best knowledge, our work is the first to enable LMM-based agents to follow the main storyline and finish real missions in complex AAA games, with minimal reliance on prior knowledge or resources. The project website is at this https URL .
- [63] arXiv:2403.03188 [ pdf , ps , other ]
-
Title: Towards Democratized Flood Risk Management: An Advanced AI Assistant Enabled by GPT-4 for Enhanced Interpretability and Public EngagementRafaela Martelo , Ruo-Qian Wang (Rutgers University)Comments: 48 pages, 3 figures and an appendix with 2 supplementary tables detailing experimental results and observations. Supported by Rutgers's Research Incubator in Climate and Health, Seed Funding Initiative and Research Council Award - "Engaged Climate Action". Source code and data available at this https URLSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Abstract: Real-time flood forecasting plays a crucial role in enabling timely and effective emergency responses. However, a significant challenge lies in bridging the gap between complex numerical flood models and practical decision-making. Decision-makers often rely on experts to interpret these models for optimizing flood mitigation strategies. And the public requires complex techniques to inquiry and understand socio-cultural and institutional factors, often hinders the public's understanding of flood risks. To overcome these challenges, our study introduces an innovative solution: a customized AI Assistant powered by the GPT-4 Large Language Model. This AI Assistant is designed to facilitate effective communication between decision-makers, the general public, and flood forecasters, without the requirement of specialized knowledge. The new framework utilizes GPT-4's advanced natural language understanding and function calling capabilities to provide immediate flood alerts and respond to various flood-related inquiries. Our developed prototype integrates real-time flood warnings with flood maps and social vulnerability data. It also effectively translates complex flood zone information into actionable risk management advice. To assess its performance, we evaluated the prototype using six criteria within three main categories: relevance, error resilience, and understanding of context. Our research marks a significant step towards a more accessible and user-friendly approach in flood risk management. This study highlights the potential of advanced AI tools like GPT-4 in democratizing information and enhancing public engagement in critical social and environmental issues.
- [64] arXiv:2403.03203 [ pdf , ps , html , other ]
-
Title: CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable EnvironmentsComments: 17 pages, 10 images, Accepted at LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and EvaluationSubjects: Artificial Intelligence (cs.AI)
Abstract: The integration of learning and reasoning is high on the research agenda in AI. Nevertheless, there is only a little attention to use existing background knowledge for reasoning about partially observed scenes to answer questions about the scene. Yet, we as humans use such knowledge frequently to infer plausible answers to visual questions (by eliminating all inconsistent ones). Such knowledge often comes in the form of constraints about objects and it tends to be highly domain or environment-specific. We contribute a novel benchmark called CLEVR-POC for reasoning-intensive visual question answering (VQA) in partially observable environments under constraints. In CLEVR-POC, knowledge in the form of logical constraints needs to be leveraged to generate plausible answers to questions about a hidden object in a given partial scene. For instance, if one has the knowledge that all cups are colored either red, green or blue and that there is only one green cup, it becomes possible to deduce the color of an occluded cup as either red or blue, provided that all other cups, including the green one, are observed. Through experiments, we observe that the low performance of pre-trained vision language models like CLIP (~ 22%) and a large language model (LLM) like GPT-4 (~ 46%) on CLEVR-POC ascertains the necessity for frameworks that can handle reasoning-intensive tasks where environment-specific background knowledge is available and crucial. Furthermore, our demonstration illustrates that a neuro-symbolic model, which integrates an LLM like GPT-4 with a visual perception network and a formal logical reasoner, exhibits exceptional performance on CLEVR-POC.
- [65] arXiv:2403.03288 [ pdf , ps , other ]
-
Title: Should We Fear Large Language Models? A Structural Analysis of the Human Reasoning System for Elucidating LLM Capabilities and Risks Through the Lens of Heidegger's PhilosophyComments: 39 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: In the rapidly evolving field of Large Language Models (LLMs), there is a critical need to thoroughly analyze their capabilities and risks. Central to our investigation are two novel elements. Firstly, it is the innovative parallels between the statistical patterns of word relationships within LLMs and Martin Heidegger's concepts of "ready-to-hand" and "present-at-hand," which encapsulate the utilitarian and scientific altitudes humans employ in interacting with the world. This comparison lays the groundwork for positioning LLMs as the digital counterpart to the Faculty of Verbal Knowledge, shedding light on their capacity to emulate certain facets of human reasoning. Secondly, a structural analysis of human reasoning, viewed through Heidegger's notion of truth as "unconcealment" is conducted This foundational principle enables us to map out the inputs and outputs of the reasoning system and divide reasoning into four distinct categories. Respective cognitive faculties are delineated, allowing us to place LLMs within the broader schema of human reasoning, thus clarifying their strengths and inherent limitations. Our findings reveal that while LLMs possess the capability for Direct Explicative Reasoning and Pseudo Rational Reasoning, they fall short in authentic rational reasoning and have no creative reasoning capabilities, due to the current lack of many analogous AI models such as the Faculty of Judgement. The potential and risks of LLMs when they are augmented with other AI technologies are also evaluated. The results indicate that although LLMs have achieved proficiency in some reasoning abilities, the aspiration to match or exceed human intellectual capabilities is yet unattained. This research not only enriches our comprehension of LLMs but also propels forward the discourse on AI's potential and its bounds, paving the way for future explorations into AI's evolving landscape.
- [66] arXiv:2403.03293 [ pdf , ps , html , other ]
-
Title: AI Insights: A Case Study on Utilizing ChatGPT Intelligence for Research Paper AnalysisAnjalee De Silva , Janaka L. Wijekoon , Rashini Liyanarachchi , Rrubaa Panchendrarajan , Weranga RajapakshaSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper discusses the effectiveness of leveraging Chatbot: Generative Pre-trained Transformer (ChatGPT) versions 3.5 and 4 for analyzing research papers for effective writing of scientific literature surveys. The study selected the \textit{Application of Artificial Intelligence in Breast Cancer Treatment} as the research topic. Research papers related to this topic were collected from three major publication databases Google Scholar, Pubmed, and Scopus. ChatGPT models were used to identify the category, scope, and relevant information from the research papers for automatic identification of relevant papers related to Breast Cancer Treatment (BCT), organization of papers according to scope, and identification of key information for survey paper writing. Evaluations performed using ground truth data annotated using subject experts reveal, that GPT-4 achieves 77.3\% accuracy in identifying the research paper categories and 50\% of the papers were correctly identified by GPT-4 for their scopes. Further, the results demonstrate that GPT-4 can generate reasons for its decisions with an average of 27\% new words, and 67\% of the reasons given by the model were completely agreeable to the subject experts.
- [67] arXiv:2403.03357 [ pdf , ps , html , other ]
-
Title: The Case for Globalizing Fairness: A Mixed Methods Study on Colonialism, AI, and Health in AfricaMercy Asiedu , Awa Dieng , Iskandar Haykel , Negar Rostamzadeh , Stephen Pfohl , Chirag Nagpal , Maria Nagawa , Abigail Oppong , Sanmi Koyejo , Katherine HellerComments: 11 pages, 4 figures. arXiv admin note: text overlap with arXiv:2304.02190Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: With growing application of machine learning (ML) technologies in healthcare, there have been calls for developing techniques to understand and mitigate biases these systems may exhibit. Fair-ness considerations in the development of ML-based solutions for health have particular implications for Africa, which already faces inequitable power imbalances between the Global North and South.This paper seeks to explore fairness for global health, with Africa as a case study. We conduct a scoping review to propose axes of disparities for fairness consideration in the African context and delineate where they may come into play in different ML-enabled medical modalities. We then conduct qualitative research studies with 672 general population study participants and 28 experts inML, health, and policy focused on Africa to obtain corroborative evidence on the proposed axes of disparities. Our analysis focuses on colonialism as the attribute of interest and examines the interplay between artificial intelligence (AI), health, and colonialism. Among the pre-identified attributes, we found that colonial history, country of origin, and national income level were specific axes of disparities that participants believed would cause an AI system to be biased.However, there was also divergence of opinion between experts and general population participants. Whereas experts generally expressed a shared view about the relevance of colonial history for the development and implementation of AI technologies in Africa, the majority of the general population participants surveyed did not think there was a direct link between AI and colonialism. Based on these findings, we provide practical recommendations for developing fairness-aware ML solutions for health in Africa.
- [68] arXiv:2403.03359 [ pdf , ps , html , other ]
-
Title: RACE-SM: Reinforcement Learning Based Autonomous Control for Social On-Ramp MergingComments: Updated explanation of TTC, page 7Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Autonomous parallel-style on-ramp merging in human controlled traffic continues to be an existing issue for autonomous vehicle control. Existing non-learning based solutions for vehicle control rely on rules and optimization primarily. These methods have been seen to present significant challenges. Recent advancements in Deep Reinforcement Learning have shown promise and have received significant academic interest however the available learning based approaches show inadequate attention to other highway vehicles and often rely on inaccurate road traffic assumptions. In addition, the parallel-style case is rarely considered. A novel learning based model for acceleration and lane change decision making that explicitly considers the utility to both the ego vehicle and its surrounding vehicles which may be cooperative or uncooperative to produce behaviour that is socially acceptable is proposed. The novel reward function makes use of Social Value Orientation to weight the vehicle's level of social cooperation and is divided into ego vehicle and surrounding vehicle utility which are weighted according to the model's designated Social Value Orientation. A two-lane highway with an on-ramp divided into a taper-style and parallel-style section is considered. Simulation results indicated the importance of considering surrounding vehicles in reward function design and show that the proposed model matches or surpasses those in literature in terms of collisions while also introducing socially courteous behaviour avoiding near misses and anti-social behaviour through direct consideration of the effect of merging on surrounding vehicles.
- [69] arXiv:2403.03382 [ pdf , ps , html , other ]
-
Title: Adaptive Discovering and Merging for Incremental Novel Class DiscoveryComments: AAAI 2024. arXiv admin note: text overlap with arXiv:2207.08605 by other authorsSubjects: Artificial Intelligence (cs.AI)
Abstract: One important desideratum of lifelong learning aims to discover novel classes from unlabelled data in a continuous manner. The central challenge is twofold: discovering and learning novel classes while mitigating the issue of catastrophic forgetting of established knowledge. To this end, we introduce a new paradigm called Adaptive Discovering and Merging (ADM) to discover novel categories adaptively in the incremental stage and integrate novel knowledge into the model without affecting the original knowledge. To discover novel classes adaptively, we decouple representation learning and novel class discovery, and use Triple Comparison (TC) and Probability Regularization (PR) to constrain the probability discrepancy and diversity for adaptive category assignment. To merge the learned novel knowledge adaptively, we propose a hybrid structure with base and novel branches named Adaptive Model Merging (AMM), which reduces the interference of the novel branch on the old classes to preserve the previous knowledge, and merges the novel branch to the base model without performance loss and parameter growth. Extensive experiments on several datasets show that ADM significantly outperforms existing class-incremental Novel Class Discovery (class-iNCD) approaches. Moreover, our AMM also benefits the class-incremental Learning (class-IL) task by alleviating the catastrophic forgetting problem.
- [70] arXiv:2403.03401 [ pdf , ps , html , other ]
-
Title: BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-ProvingSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Abstract: Artificial Intelligence for Theorem Proving has given rise to a plethora of benchmarks and methodologies, particularly in Interactive Theorem Proving (ITP). Research in the area is fragmented, with a diverse set of approaches being spread across several ITP systems. This presents a significant challenge to the comparison of methods, which are often complex and difficult to replicate. Addressing this, we present BAIT, a framework for fair and streamlined comparison of learning approaches in ITP. We demonstrate BAIT's capabilities with an in-depth comparison, across several ITP benchmarks, of state-of-the-art architectures applied to the problem of formula embedding. We find that Structure Aware Transformers perform particularly well, improving on techniques associated with the original problem sets. BAIT also allows us to assess the end-to-end proving performance of systems built on interactive environments. This unified perspective reveals a novel end-to-end system that improves on prior work. We also provide a qualitative analysis, illustrating that improved performance is associated with more semantically-aware embeddings. By streamlining the implementation and comparison of Machine Learning algorithms in the ITP context, we anticipate BAIT will be a springboard for future research.
- [71] arXiv:2403.03406 [ pdf , ps , other ]
-
Title: An EnKF-LSTM Assimilation Algorithm for Crop Growth ModelSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Accurate and timely prediction of crop growth is of great significance to ensure crop yields and researchers have developed several crop models for the prediction of crop growth. However, there are large difference between the simulation results obtained by the crop models and the actual results, thus in this paper, we proposed to combine the simulation results with the collected crop data for data assimilation so that the accuracy of prediction will be improved. In this paper, an EnKF-LSTM data assimilation method for various crops is proposed by combining ensemble Kalman filter and LSTM neural network, which effectively avoids the overfitting problem of existing data assimilation methods and eliminates the uncertainty of the measured data. The verification of the proposed EnKF-LSTM method and the comparison of the proposed method with other data assimilation methods were performed using datasets collected by sensor equipment deployed on a farm.
- [72] arXiv:2403.03517 [ pdf , ps , html , other ]
-
Title: IB-Net: Initial Branch Network for Variable Decision in Boolean SatisfiabilityComments: 7 pages, 12 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Boolean Satisfiability problems are vital components in Electronic Design Automation, particularly within the Logic Equivalence Checking process. Currently, SAT solvers are employed for these problems and neural network is tried as assistance to solvers. However, as SAT problems in the LEC context are distinctive due to their predominantly unsatisfiability nature and a substantial proportion of UNSAT-core variables, existing neural network assistance has proven unsuccessful in this specialized domain. To tackle this challenge, we propose IB-Net, an innovative framework utilizing graph neural networks and novel graph encoding techniques to model unsatisfiable problems and interact with state-of-the-art solvers. Extensive evaluations across solvers and datasets demonstrate IB-Net's acceleration, achieving an average runtime speedup of 5.0% on industrial data and 8.3% on SAT competition data empirically. This breakthrough advances efficient solving in LEC workflows.
- [73] arXiv:2403.03544 [ pdf , ps , html , other ]
-
Title: Prompt Mining for Language-based Human Mobility ForecastingSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: With the advancement of large language models, language-based forecasting has recently emerged as an innovative approach for predicting human mobility patterns. The core idea is to use prompts to transform the raw mobility data given as numerical values into natural language sentences so that the language models can be leveraged to generate the description for future observations. However, previous studies have only employed fixed and manually designed templates to transform numerical values into sentences. Since the forecasting performance of language models heavily relies on prompts, using fixed templates for prompting may limit the forecasting capability of language models. In this paper, we propose a novel framework for prompt mining in language-based mobility forecasting, aiming to explore diverse prompt design strategies. Specifically, the framework includes a prompt generation stage based on the information entropy of prompts and a prompt refinement stage to integrate mechanisms such as the chain of thought. Experimental results on real-world large-scale data demonstrate the superiority of generated prompts from our prompt mining pipeline. Additionally, the comparison of different prompt variants shows that the proposed prompt refinement process is effective. Our study presents a promising direction for further advancing language-based mobility forecasting.
- [74] arXiv:2403.03550 [ pdf , ps , other ]
-
Title: Emotional Manipulation Through Prompt Engineering Amplifies Disinformation Generation in AI Large Language ModelsComments: 14 pages, 3 figuresSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Abstract: This study investigates the generation of synthetic disinformation by OpenAI's Large Language Models (LLMs) through prompt engineering and explores their responsiveness to emotional prompting. Leveraging various LLM iterations using davinci-002, davinci-003, gpt-3.5-turbo and gpt-4, we designed experiments to assess their success in producing disinformation. Our findings, based on a corpus of 19,800 synthetic disinformation social media posts, reveal that all LLMs by OpenAI can successfully produce disinformation, and that they effectively respond to emotional prompting, indicating their nuanced understanding of emotional cues in text generation. When prompted politely, all examined LLMs consistently generate disinformation at a high frequency. Conversely, when prompted impolitely, the frequency of disinformation production diminishes, as the models often refuse to generate disinformation and instead caution users that the tool is not intended for such purposes. This research contributes to the ongoing discourse surrounding responsible development and application of AI technologies, particularly in mitigating the spread of disinformation and promoting transparency in AI-generated content.
- [75] arXiv:2403.03594 [ pdf , ps , html , other ]
-
Title: Assessing the Aesthetic Evaluation Capabilities of GPT-4 with Vision: Insights from Group and Individual AssessmentsComments: 8 pages, 6 figures, submitted to The 38th Annual Conference of the Japanese Society for Artificial Intelligence, 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Recently, it has been recognized that large language models demonstrate high performance on various intellectual tasks. However, few studies have investigated alignment with humans in behaviors that involve sensibility, such as aesthetic evaluation. This study investigates the performance of GPT-4 with Vision, a state-of-the-art language model that can handle image input, on the task of aesthetic evaluation of images. We employ two tasks, prediction of the average evaluation values of a group and an individual's evaluation values. We investigate the performance of GPT-4 with Vision by exploring prompts and analyzing prediction behaviors. Experimental results reveal GPT-4 with Vision's superior performance in predicting aesthetic evaluations and the nature of different responses to beauty and ugliness. Finally, we discuss developing an AI system for aesthetic evaluation based on scientific knowledge of the human perception of beauty, employing agent technologies that integrate traditional deep learning models with large language models.
- [76] arXiv:2403.03600 [ pdf , ps , html , other ]
-
Title: A Privacy-Preserving Framework with Multi-Modal Data for Cross-Domain RecommendationSubjects: Artificial Intelligence (cs.AI)
Abstract: Cross-domain recommendation (CDR) aims to enhance recommendation accuracy in a target domain with sparse data by leveraging rich information in a source domain, thereby addressing the data-sparsity problem. Some existing CDR methods highlight the advantages of extracting domain-common and domain-specific features to learn comprehensive user and item representations. However, these methods can't effectively disentangle these components as they often rely on simple user-item historical interaction information (such as ratings, clicks, and browsing), neglecting the rich multi-modal features. Additionally, they don't protect user-sensitive data from potential leakage during knowledge transfer between domains. To address these challenges, we propose a Privacy-Preserving Framework with Multi-Modal Data for Cross-Domain Recommendation, called P2M2-CDR. Specifically, we first design a multi-modal disentangled encoder that utilizes multi-modal information to disentangle more informative domain-common and domain-specific embeddings. Furthermore, we introduce a privacy-preserving decoder to mitigate user privacy leakage during knowledge transfer. Local differential privacy (LDP) is utilized to obfuscate the disentangled embeddings before inter-domain exchange, thereby enhancing privacy protection. To ensure both consistency and differentiation among these obfuscated disentangled embeddings, we incorporate contrastive learning-based domain-inter and domain-intra losses. Extensive Experiments conducted on four real-world datasets demonstrate that P2M2-CDR outperforms other state-of-the-art single-domain and cross-domain baselines.
- [77] arXiv:2403.03607 [ pdf , ps , other ]
-
Title: The Geometric Structure of Topic ModelsSubjects: Artificial Intelligence (cs.AI)
Abstract: Topic models are a popular tool for clustering and analyzing textual data. They allow texts to be classified on the basis of their affiliation to the previously calculated topics. Despite their widespread use in research and application, an in-depth analysis of topic models is still an open research topic. State-of-the-art methods for interpreting topic models are based on simple visualizations, such as similarity matrices, top-term lists or embeddings, which are limited to a maximum of three dimensions. In this paper, we propose an incidence-geometric method for deriving an ordinal structure from flat topic models, such as non-negative matrix factorization. These enable the analysis of the topic model in a higher (order) dimension and the possibility of extracting conceptual relationships between several topics at once. Due to the use of conceptual scaling, our approach does not introduce any artificial topical relationships, such as artifacts of feature compression. Based on our findings, we present a new visualization paradigm for concept hierarchies based on ordinal motifs. These allow for a top-down view on topic spaces. We introduce and demonstrate the applicability of our approach based on a topic model derived from a corpus of scientific papers taken from 32 top machine learning venues.
- [78] arXiv:2403.03636 [ pdf , ps , other ]
-
Title: SheetAgent: A Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language ModelsComments: 24 pages, 14 figuresSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Spreadsheet manipulation is widely existing in most daily works and significantly improves working efficiency. Large language model (LLM) has been recently attempted for automatic spreadsheet manipulation but has not yet been investigated in complicated and realistic tasks where reasoning challenges exist (e.g., long horizon manipulation with multi-step reasoning and ambiguous requirements). To bridge the gap with the real-world requirements, we introduce $\textbf{SheetRM}$, a benchmark featuring long-horizon and multi-category tasks with reasoning-dependent manipulation caused by real-life challenges. To mitigate the above challenges, we further propose $\textbf{SheetAgent}$, a novel autonomous agent that utilizes the power of LLMs. SheetAgent consists of three collaborative modules: $\textit{Planner}$, $\textit{Informer}$, and $\textit{Retriever}$, achieving both advanced reasoning and accurate manipulation over spreadsheets without human interaction through iterative task reasoning and reflection. Extensive experiments demonstrate that SheetAgent delivers 20-30% pass rate improvements on multiple benchmarks over baselines, achieving enhanced precision in spreadsheet manipulation and demonstrating superior table reasoning abilities. More details and visualizations are available at this https URL .
- [79] arXiv:2403.03645 [ pdf , ps , html , other ]
-
Title: K-Link: Knowledge-Link Graph from LLMs for Enhanced Representation Learning in Multivariate Time-Series DataComments: 12 pages,7 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Sourced from various sensors and organized chronologically, Multivariate Time-Series (MTS) data involves crucial spatial-temporal dependencies, e.g., correlations among sensors. To capture these dependencies, Graph Neural Networks (GNNs) have emerged as powerful tools, yet their effectiveness is restricted by the quality of graph construction from MTS data. Typically, existing approaches construct graphs solely from MTS signals, which may introduce bias due to a small training dataset and may not accurately represent underlying dependencies. To address this challenge, we propose a novel framework named K-Link, leveraging Large Language Models (LLMs) to encode extensive general knowledge and thereby providing effective solutions to reduce the bias. Leveraging the knowledge embedded in LLMs, such as physical principles, we extract a \textit{Knowledge-Link graph}, capturing vast semantic knowledge of sensors and the linkage of the sensor-level knowledge. To harness the potential of the knowledge-link graph in enhancing the graph derived from MTS data, we propose a graph alignment module, facilitating the transfer of semantic knowledge within the knowledge-link graph into the MTS-derived graph. By doing so, we can improve the graph quality, ensuring effective representation learning with GNNs for MTS data. Extensive experiments demonstrate the efficacy of our approach for superior performance across various MTS-related downstream tasks.
- [80] arXiv:2403.03744 [ pdf , ps , html , other ]
-
Title: Towards Safe Large Language Models for MedicineSubjects: Artificial Intelligence (cs.AI)
Abstract: As large language models (LLMs) develop ever-improving capabilities and are applied in real-world settings, it is important to understand their safety. While initial steps have been taken to evaluate the safety of general-knowledge LLMs, exposing some weaknesses, the safety of medical LLMs has not been sufficiently evaluated despite their high risks to personal health and safety, public health and safety, patient rights, and human rights. To address this gap, we conduct, to our knowledge, the first study of its kind to evaluate and improve the safety of medical LLMs. We find that 1) current medical LLMs do not meet standards of general or medical safety, as they readily comply with harmful requests and that 2) fine-tuning medical LLMs on safety demonstrations significantly improves their safety, reducing their tendency to comply with harmful requests. In addition, we present a definition of medical safety for LLMs and develop a benchmark dataset to evaluate and train for medical safety in LLMs. Poised at the intersection of research on machine learning safety and medical machine learning, this work casts light on the status quo of the safety of medical LLMs and motivates future work in this area, mitigating the risks of harm of LLMs in medicine.
- [81] arXiv:2403.03768 [ pdf , ps , html , other ]
-
Title: DeepCRE: Transforming Drug R&D via AI-Driven Cross-drug Response EvaluationYushuai Wu , Ting Zhang , Hao Zhou , Hainan Wu , Hanwen Sunchu , Lei Hu , Xiaofang Chen , Suyuan Zhao , Gaochao Liu , Chao Sun , Jiahuan Zhang , Yizhen Luo , Peng Liu , Zaiqing Nie , Yushuai WuSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Abstract: The fields of therapeutic application and drug research and development (R&D) both face substantial challenges, i.e., the therapeutic domain calls for more treatment alternatives, while numerous promising pre-clinical drugs have failed in clinical trials. One of the reasons is the inadequacy of Cross-drug Response Evaluation (CRE) during the late stages of drug R&D. Although in-silico CRE models bring a promising solution, existing methodologies are restricted to early stages of drug R&D, such as target and cell-line levels, offering limited improvement to clinical success rates. Herein, we introduce DeepCRE, a pioneering AI model designed to predict CRE effectively in the late stages of drug R&D. DeepCRE outperforms the existing best models by achieving an average performance improvement of 17.7% in patient-level CRE, and a 5-fold increase in indication-level CRE, facilitating more accurate personalized treatment predictions and better pharmaceutical value assessment for indications, respectively. Furthermore, DeepCRE has identified a set of six drug candidates that show significantly greater effectiveness than a comparator set of two approved drugs in 5/8 colorectal cancer organoids. This demonstrates the capability of DeepCRE to systematically uncover a spectrum of drug candidates with enhanced therapeutic effects, highlighting its potential to transform drug R&D.
- [82] arXiv:2403.03828 [ pdf , ps , other ]
-
Title: From Clicks to Security: Investigating Continuous Authentication via Mouse DynamicsSubjects: Artificial Intelligence (cs.AI)
Abstract: In the realm of computer security, the importance of efficient and reliable user authentication methods has become increasingly critical. This paper examines the potential of mouse movement dynamics as a consistent metric for continuous authentication. By analyzing user mouse movement patterns in two contrasting gaming scenarios, "Team Fortress" and Poly Bridge we investigate the distinctive behavioral patterns inherent in high-intensity and low-intensity UI interactions. The study extends beyond conventional methodologies by employing a range of machine learning models. These models are carefully selected to assess their effectiveness in capturing and interpreting the subtleties of user behavior as reflected in their mouse movements. This multifaceted approach allows for a more nuanced and comprehensive understanding of user interaction patterns. Our findings reveal that mouse movement dynamics can serve as a reliable indicator for continuous user authentication. The diverse machine learning models employed in this study demonstrate competent performance in user verification, marking an improvement over previous methods used in this field. This research contributes to the ongoing efforts to enhance computer security and highlights the potential of leveraging user behavior, specifically mouse dynamics, in developing robust authentication systems.
- [83] arXiv:2403.03832 [ pdf , ps , other ]
-
Title: Your device may know you better than you know yourself -- continuous authentication on novel dataset using machine learningSubjects: Artificial Intelligence (cs.AI)
Abstract: This research aims to further understanding in the field of continuous authentication using behavioral biometrics. We are contributing a novel dataset that encompasses the gesture data of 15 users playing Minecraft with a Samsung Tablet, each for a duration of 15 minutes. Utilizing this dataset, we employed machine learning (ML) binary classifiers, being Random Forest (RF), K-Nearest Neighbors (KNN), and Support Vector Classifier (SVC), to determine the authenticity of specific user actions. Our most robust model was SVC, which achieved an average accuracy of approximately 90%, demonstrating that touch dynamics can effectively distinguish users. However, further studies are needed to make it viable option for authentication systems
- [84] arXiv:2403.03894 [ pdf , ps , html , other ]
-
Title: IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code GeneratorsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Programming Languages (cs.PL)
Abstract: Code understanding and generation have fast become some of the most popular applications of language models (LMs). Nonetheless, research on multilingual aspects of Code-LMs (i.e., LMs for code generation) such as cross-lingual transfer between different programming languages, language-specific data augmentation, and post-hoc LM adaptation, alongside exploitation of data sources other than the original textual content, has been much sparser than for their natural language counterparts. In particular, most mainstream Code-LMs have been pre-trained on source code files alone. In this work, we investigate the prospect of leveraging readily available compiler intermediate representations (IR) - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer.
To this end, we first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files coupled with respective intermediate representations. Next, starting from various base Code-LMs (ranging in size from 1.1B to 7.3B parameters), we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to (1) learn the IR language and (2) align the IR constructs with respective constructs of various programming languages. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following. - [85] arXiv:2403.03920 [ pdf , ps , html , other ]
-
Title: Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational ArtifactsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Abstract: This paper explores the transformative potential of computer-assisted textual analysis in enhancing instructional quality through in-depth insights from educational artifacts. We integrate Richard Elmore's Instructional Core Framework to examine how artificial intelligence (AI) and machine learning (ML) methods, particularly natural language processing (NLP), can analyze educational content, teacher discourse, and student responses to foster instructional improvement. Through a comprehensive review and case studies within the Instructional Core Framework, we identify key areas where AI/ML integration offers significant advantages, including teacher coaching, student support, and content development. We unveil patterns that indicate AI/ML not only streamlines administrative tasks but also introduces novel pathways for personalized learning, providing actionable feedback for educators and contributing to a richer understanding of instructional dynamics. This paper emphasizes the importance of aligning AI/ML technologies with pedagogical goals to realize their full potential in educational settings, advocating for a balanced approach that considers ethical considerations, data quality, and the integration of human expertise.
- [86] arXiv:2403.03996 [ pdf , ps , other ]
-
Title: Rethinking Urban Flood Risk Assessment By Adapting Health Domain PerspectiveSubjects: Artificial Intelligence (cs.AI)
Abstract: Inspired by ideas from health risk assessment, this paper presents a new perspective for flood risk assessment. The proposed perspective focuses on three pillars for examining flood risk: (1) inherent susceptibility, (2) mitigation strategies, and (3) external stressors. These pillars collectively encompass the physical and environmental characteristics of urban areas, the effectiveness of human-intervention measures, and the influence of uncontrollable external factors, offering a fresh point of view for decoding flood risks. For each pillar, we delineate its individual contributions to flood risk and illustrate their interactive and overall impact. The three-pillars model embodies a shift in focus from the quest to precisely model and quantify flood risk to evaluating pathways to high flood risk. The shift in perspective is intended to alleviate the quest for quantifying and predicting flood risk at fine resolutions as a panacea for enhanced flood risk management. The decomposition of flood risk pathways into the three intertwined pillars (i.e., inherent factors, mitigation factors, and external factors) enables evaluation of changes in factors within each pillar enhance and exacerbate flood risk, creating a platform from which to inform plans, decisions, and actions. Building on this foundation, we argue that a flood risk pathway analysis approach, which examines the individual and collective impacts of inherent factors, mitigation strategies, and external stressors, is essential for a nuanced evaluation of flood risk. Accordingly, the proposed perspective could complement the existing frameworks and approaches for flood risk assessment.
- [87] arXiv:2403.03997 [ pdf , ps , html , other ]
-
Title: Guiding Enumerative Program Synthesis with Large Language ModelsComments: 27 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: Pre-trained Large Language Models (LLMs) are beginning to dominate the discourse around automatic code generation with natural language specifications. In contrast, the best-performing synthesizers in the domain of formal synthesis with precise logical specifications are still based on enumerative algorithms. In this paper, we evaluate the abilities of LLMs to solve formal synthesis benchmarks by carefully crafting a library of prompts for the domain. When one-shot synthesis fails, we propose a novel enumerative synthesis algorithm, which integrates calls to an LLM into a weighted probabilistic search. This allows the synthesizer to provide the LLM with information about the progress of the enumerator, and the LLM to provide the enumerator with syntactic guidance in an iterative loop. We evaluate our techniques on benchmarks from the Syntax-Guided Synthesis (SyGuS) competition. We find that GPT-3.5 as a stand-alone tool for formal synthesis is easily outperformed by state-of-the-art formal synthesis algorithms, but our approach integrating the LLM into an enumerative synthesis algorithm shows significant performance gains over both the LLM and the enumerative synthesizer alone and the winning SyGuS competition tool.
- [88] arXiv:2403.04017 [ pdf , ps , html , other ]
-
Title: Learning Guided Automated Reasoning: A Brief SurveyLasse Blaauwbroek , David Cerna , Thibault Gauthier , Jan Jakubův , Cezary Kaliszyk , Martin Suda , Josef UrbanSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
Abstract: Automated theorem provers and formal proof assistants are general reasoning systems that are in theory capable of proving arbitrarily hard theorems, thus solving arbitrary problems reducible to mathematics and logical reasoning. In practice, such systems however face large combinatorial explosion, and therefore include many heuristics and choice points that considerably influence their performance. This is an opportunity for trained machine learning predictors, which can guide the work of such reasoning systems. Conversely, deductive search supported by the notion of logically valid proof allows one to train machine learning systems on large reasoning corpora. Such bodies of proof are usually correct by construction and when combined with more and more precise trained guidance they can be boostrapped into very large corpora, with increasingly long reasoning chains and possibly novel proof ideas. In this paper we provide an overview of several automated reasoning and theorem proving domains and the learning and AI methods that have been so far developed for them. These include premise selection, proof guidance in several settings, AI systems and feedback loops iterating between reasoning and learning, and symbolic classification problems.
- [89] arXiv:2403.04035 [ pdf , ps , html , other ]
-
Title: Personalizing explanations of AI-driven hints to users' cognitive abilities: an empirical evaluationSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Abstract: We investigate personalizing the explanations that an Intelligent Tutoring System generates to justify the hints it provides to students to foster their learning. The personalization targets students with low levels of two traits, Need for Cognition and Conscientiousness, and aims to enhance these students' engagement with the explanations, based on prior findings that these students do not naturally engage with the explanations but they would benefit from them if they do. To evaluate the effectiveness of the personalization, we conducted a user study where we found that our proposed personalization significantly increases our target users' interaction with the hint explanations, their understanding of the hints and their learning. Hence, this work provides valuable insights into effectively personalizing AI-driven explanations for cognitively demanding tasks such as learning.
- [90] arXiv:2403.04072 [ pdf , ps , html , other ]
-
Title: Forecasting and Mitigating Disruptions in Public Bus Transit ServicesSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Public transportation systems often suffer from unexpected fluctuations in demand and disruptions, such as mechanical failures and medical emergencies. These fluctuations and disruptions lead to delays and overcrowding, which are detrimental to the passengers' experience and to the overall performance of the transit service. To proactively mitigate such events, many transit agencies station substitute (reserve) vehicles throughout their service areas, which they can dispatch to augment or replace vehicles on routes that suffer overcrowding or disruption. However, determining the optimal locations where substitute vehicles should be stationed is a challenging problem due to the inherent randomness of disruptions and due to the combinatorial nature of selecting locations across a city. In collaboration with the transit agency of Nashville, TN, we address this problem by introducing data-driven statistical and machine-learning models for forecasting disruptions and an effective randomized local-search algorithm for selecting locations where substitute vehicles are to be stationed. Our research demonstrates promising results in proactive disruption management, offering a practical and easily implementable solution for transit agencies to enhance the reliability of their services. Our results resonate beyond mere operational efficiency: by advancing proactive strategies, our approach fosters more resilient and accessible public transportation, contributing to equitable urban mobility and ultimately benefiting the communities that rely on public transportation the most.
- [91] arXiv:2403.04087 [ pdf , ps , html , other ]
-
Title: The Cognitive Type Project -- Mapping Typography to CognitionSubjects: Artificial Intelligence (cs.AI)
Abstract: The Cognitive Type Project is focused on developing computational tools to enable the design of typefaces with varying cognitive properties. This initiative aims to empower typographers to craft fonts that enhance click-through rates for online ads, improve reading levels in children's books, enable dyslexics to create personalized type, or provide insights into customer reactions to textual content in media. A significant challenge in research related to mapping typography to cognition is the creation of thousands of typefaces with minor variations, a process that is both labor-intensive and requires the expertise of skilled typographers. Cognitive science research highlights that the design and form of letters, along with the text's overall layout, are crucial in determining the ease of reading and other cognitive properties of type such as perceived beauty and memorability. These factors affect not only the legibility and clarity of information presentation but also the likability of a typeface.
- [92] arXiv:2403.04105 [ pdf , ps , html , other ]
-
Title: Artificial Intelligence Exploring the Patent FieldComments: 53 pages, 14 figures, 5 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: Advanced language-processing and machine-learning techniques promise massive efficiency improvements in the previously widely manual field of patent and technical knowledge management. This field presents large-scale and complex data with very precise contents and language representation of those contents. Particularly, patent texts can differ from mundane texts in various aspects, which entails significant opportunities and challenges. This paper presents a systematic overview of patent-related tasks and popular methodologies with a special focus on evolving and promising techniques. Language processing and particularly large language models as well as the recent boost of general generative methods promise to become game changers in the patent field. The patent literature and the fact-based argumentative procedures around patents appear almost as an ideal use case. However, patents entail a number of difficulties with which existing models struggle. The paper introduces fundamental aspects of patents and patent-related data that affect technology that wants to explore or manage them. It further reviews existing methods and approaches and points out how important reliable and unbiased evaluation metrics become. Although research has made substantial progress on certain tasks, the performance across many others remains suboptimal, sometimes because of either the special nature of patents and their language or inconsistencies between legal terms and the everyday meaning of terms. Moreover, yet few methods have demonstrated the ability to produce satisfactory text for specific sections of patents. By pointing out key developments, opportunities, and gaps, we aim to encourage further research and accelerate the advancement of this field.
- [93] arXiv:2403.04106 [ pdf , ps , other ]
-
Title: Understanding Biology in the Age of Artificial IntelligenceElsa Lawrence , Adham El-Shazly , Srijit Seal , Chaitanya K Joshi , Pietro Liò , Shantanu Singh , Andreas Bender , Pietro Sormanni , Matthew GreenigSubjects: Artificial Intelligence (cs.AI)
Abstract: Modern life sciences research is increasingly relying on artificial intelligence approaches to model biological systems, primarily centered around the use of machine learning (ML) models. Although ML is undeniably useful for identifying patterns in large, complex data sets, its widespread application in biological sciences represents a significant deviation from traditional methods of scientific inquiry. As such, the interplay between these models and scientific understanding in biology is a topic with important implications for the future of scientific research, yet it is a subject that has received little attention. Here, we draw from an epistemological toolkit to contextualize recent applications of ML in biological sciences under modern philosophical theories of understanding, identifying general principles that can guide the design and application of ML systems to model biological phenomena and advance scientific knowledge. We propose that conceptions of scientific understanding as information compression, qualitative intelligibility, and dependency relation modelling provide a useful framework for interpreting ML-mediated understanding of biological systems. Through a detailed analysis of two key application areas of ML in modern biological research - protein structure prediction and single cell RNA-sequencing - we explore how these features have thus far enabled ML systems to advance scientific understanding of their target phenomena, how they may guide the development of future ML models, and the key obstacles that remain in preventing ML from achieving its potential as a tool for biological discovery. Consideration of the epistemological features of ML applications in biology will improve the prospects of these methods to solve important problems and advance scientific understanding of living systems.
- [94] arXiv:2403.04121 [ pdf , ps , html , other ]
-
Title: Can Large Language Models Reason and Plan?Comments: arXiv admin note: text overlap with arXiv:2402.01817 (v2 add creative commons attribution to Figure 2 graphic)Journal-ref: Annals of The New York Academy of Sciences; March 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: While humans sometimes do show the capability of correcting their own erroneous guesses with self-critiquing, there seems to be no basis for that assumption in the case of LLMs.
- [95] arXiv:2403.04124 [ pdf , ps , html , other ]
-
Title: Privacy-preserving Fine-tuning of Large Language Models through FlatnessComments: Accepted to ICLR 2024 SeT LLM WorkshopSubjects: Artificial Intelligence (cs.AI)
Abstract: The privacy concerns associated with the use of Large Language Models (LLMs) have grown recently with the development of LLMs such as ChatGPT. Differential Privacy (DP) techniques are explored in existing work to mitigate their privacy risks at the cost of generalization degradation. Our paper reveals that the flatness of DP-trained models' loss landscape plays an essential role in the trade-off between their privacy and generalization. We further propose a holistic framework to enforce appropriate weight flatness, which substantially improves model generalization with competitive privacy preservation. It innovates from three coarse-to-grained levels, including perturbation-aware min-max optimization on model weights within a layer, flatness-guided sparse prefix-tuning on weights across layers, and weight knowledge distillation between DP \& non-DP weights copies. Comprehensive experiments of both black-box and white-box scenarios are conducted to demonstrate the effectiveness of our proposal in enhancing generalization and maintaining DP characteristics. For instance, on text classification dataset QNLI, DP-Flat achieves similar performance with non-private full fine-tuning but with DP guarantee under privacy budget $\epsilon=3$, and even better performance given higher privacy budgets. Codes are provided in the supplement.
- [96] arXiv:2403.04132 [ pdf , ps , html , other ]
-
Title: Chatbot Arena: An Open Platform for Evaluating LLMs by Human PreferenceWei-Lin Chiang , Lianmin Zheng , Ying Sheng , Anastasios Nikolas Angelopoulos , Tianle Li , Dacheng Li , Hao Zhang , Banghua Zhu , Michael Jordan , Joseph E. Gonzalez , Ion StoicaSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{ this https URL }.
- [97] arXiv:2403.04135 [ pdf , ps , html , other ]
-
Title: Unsupervised Learning of Harmonic Analysis Based on Neural HSMM with Code Quality TemplatesComments: 20 pages, 5 figures, the original edition of this paper will be published in the ICNMC2024 Proceedings and this arXiv publication is a copySubjects: Artificial Intelligence (cs.AI)
Abstract: This paper presents a method of unsupervised learning of harmonic analysis based on a hidden semi-Markov model (HSMM). We introduce the chord quality templates, which specify the probability of pitch class emissions given a root note and a chord quality. Other probability distributions that comprise the HSMM are automatically learned via unsupervised learning, which has been a challenge in existing research. The results of the harmonic analysis of the proposed model were evaluated using existing labeled data. While our proposed method has yet to perform as well as existing models that used supervised learning and complex rule design, it has the advantage of not requiring expensive labeled data or rule elaboration. Furthermore, we also show how to recognize the tonic without prior knowledge, based on the transition probabilities of the Markov model.
- [98] arXiv:2403.04140 [ pdf , ps , html , other ]
-
Title: Contrastive Augmented Graph2Graph Memory Interaction for Few Shot Continual LearningComments: 12 Pages, 5 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Few-Shot Class-Incremental Learning (FSCIL) has gained considerable attention in recent years for its pivotal role in addressing continuously arriving classes. However, it encounters additional challenges. The scarcity of samples in new sessions intensifies overfitting, causing incompatibility between the output features of new and old classes, thereby escalating catastrophic forgetting. A prevalent strategy involves mitigating catastrophic forgetting through the Explicit Memory (EM), which comprise of class prototypes. However, current EM-based methods retrieves memory globally by performing Vector-to-Vector (V2V) interaction between features corresponding to the input and prototypes stored in EM, neglecting the geometric structure of local features. This hinders the accurate modeling of their positional relationships. To incorporate information of local geometric structure, we extend the V2V interaction to Graph-to-Graph (G2G) interaction. For enhancing local structures for better G2G alignment and the prevention of local feature collapse, we propose the Local Graph Preservation (LGP) mechanism. Additionally, to address sample scarcity in classes from new sessions, the Contrast-Augmented G2G (CAG2G) is introduced to promote the aggregation of same class features thus helps few-shot learning. Extensive comparisons on CIFAR100, CUB200, and the challenging ImageNet-R dataset demonstrate the superiority of our method over existing methods.
- [99] arXiv:2403.04204 [ pdf , ps , html , other ]
-
Title: On the Essence and Prospect: An Investigation of Alignment Approaches for Big ModelsXinpeng Wang , Shitong Duan , Xiaoyuan Yi , Jing Yao , Shanlin Zhou , Zhihua Wei , Peng Zhang , Dongkuan Xu , Maosong Sun , Xing XieComments: 23 pages, 7 figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Big models have achieved revolutionary breakthroughs in the field of AI, but they might also pose potential concerns. Addressing such concerns, alignment technologies were introduced to make these models conform to human preferences and values. Despite considerable advancements in the past year, various challenges lie in establishing the optimal alignment strategy, such as data cost and scalable oversight, and how to align remains an open question. In this survey paper, we comprehensively investigate value alignment approaches. We first unpack the historical context of alignment tracing back to the 1920s (where it comes from), then delve into the mathematical essence of alignment (what it is), shedding light on the inherent challenges. Following this foundation, we provide a detailed examination of existing alignment methods, which fall into three categories: Reinforcement Learning, Supervised Fine-Tuning, and In-context Learning, and demonstrate their intrinsic connections, strengths, and limitations, helping readers better understand this research area. In addition, two emerging topics, personal alignment, and multimodal alignment, are also discussed as novel frontiers in this field. Looking forward, we discuss potential alignment paradigms and how they could handle remaining challenges, prospecting where future alignment will go.
- [100] arXiv:2403.04261 [ pdf , ps , other ]
-
Title: Advancing Biomedical Text Mining with Community ChallengesHui Zong , Rongrong Wu , Jiaxue Cha , Erman Wu , Jiakun Li , Liang Tao , Zuofeng Li , Buzhou Tang , Bairong ShenSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The field of biomedical research has witnessed a significant increase in the accumulation of vast amounts of textual data from various sources such as scientific literatures, electronic health records, clinical trial reports, and social media. However, manually processing and analyzing these extensive and complex resources is time-consuming and inefficient. To address this challenge, biomedical text mining, also known as biomedical natural language processing, has garnered great attention. Community challenge evaluation competitions have played an important role in promoting technology innovation and interdisciplinary collaboration in biomedical text mining research. These challenges provide platforms for researchers to develop state-of-the-art solutions for data mining and information processing in biomedical research. In this article, we review the recent advances in community challenges specific to Chinese biomedical text mining. Firstly, we collect the information of these evaluation tasks, such as data sources and task types. Secondly, we conduct systematic summary and comparative analysis, including named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. Then, we summarize the potential clinical applications of these community challenge tasks from translational informatics perspective. Finally, we discuss the contributions and limitations of these community challenges, while highlighting future directions in the era of large language models.
- [101] arXiv:2403.04264 [ pdf , ps , html , other ]
-
Title: Competitive Facility Location under Random Utilities and Routing ConstraintsSubjects: Artificial Intelligence (cs.AI)
Abstract: In this paper, we study a facility location problem within a competitive market context, where customer demand is predicted by a random utility choice model. Unlike prior research, which primarily focuses on simple constraints such as a cardinality constraint on the number of selected locations, we introduce routing constraints that necessitate the selection of locations in a manner that guarantees the existence of a tour visiting all chosen locations while adhering to a specified tour length upper bound. Such routing constraints find crucial applications in various real-world scenarios. The problem at hand features a non-linear objective function, resulting from the utilization of random utilities, together with complex routing constraints, making it computationally challenging. To tackle this problem, we explore three types of valid cuts, namely, outer-approximation and submodular cuts to handle the nonlinear objective function, as well as sub-tour elimination cuts to address the complex routing constraints. These lead to the development of two exact solution methods: a nested cutting plane and nested branch-and-cut algorithms, where these valid cuts are iteratively added to a master problem through two nested loops. We also prove that our nested cutting plane method always converges to optimality after a finite number of iterations. Furthermore, we develop a local search-based metaheuristic tailored for solving large-scale instances and show its pros and cons compared to exact methods. Extensive experiments are conducted on problem instances of varying sizes, demonstrating that our approach excels in terms of solution quality and computation time when compared to other baseline approaches.
- [102] arXiv:2403.04280 [ pdf , ps , html , other ]
-
Title: A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call DomainQusai Abo Obaidah , Muhy Eddin Za'ter , Adnan Jaljuli , Ali Mahboub , Asma Hakouz , Bashar Alfrou , Yazan EstaitiaSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This work is an attempt to introduce a comprehensive benchmark for Arabic speech recognition, specifically tailored to address the challenges of telephone conversations in Arabic language. Arabic, characterized by its rich dialectal diversity and phonetic complexity, presents a number of unique challenges for automatic speech recognition (ASR) systems. These challenges are further amplified in the domain of telephone calls, where audio quality, background noise, and conversational speech styles negatively affect recognition accuracy. Our work aims to establish a robust benchmark that not only encompasses the broad spectrum of Arabic dialects but also emulates the real-world conditions of call-based communications. By incorporating diverse dialectical expressions and accounting for the variable quality of call recordings, this benchmark seeks to provide a rigorous testing ground for the development and evaluation of ASR systems capable of navigating the complexities of Arabic speech in telephonic contexts. This work also attempts to establish a baseline performance evaluation using state-of-the-art ASR technologies.
- [103] arXiv:2403.04292 [ pdf , ps , other ]
-
Title: A challenge in A(G)I, cybernetics revived in the Ouroboros Model as one algorithm for all thinkingComments: 26 pages, 11 figuresJournal-ref: Artificial Intelligence and Autonomous Systems Volume 1 Issue 1, 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: A topical challenge for algorithms in general and for automatic image categorization and generation in particular is presented in the form of a drawing for AI to understand. In a second vein, AI is challenged to produce something similar from verbal description. The aim of the paper is to highlight strengths and deficiencies of current Artificial Intelligence approaches while coarsely sketching a way forward. A general lack of encompassing symbol-embedding and (not only) -grounding in some bodily basis is made responsible for current deficiencies. A concomitant dearth of hierarchical organization of concepts follows suite. As a remedy for these shortcomings, it is proposed to take a wide step back and to newly incorporate aspects of cybernetics and analog control processes. It is claimed that a promising overarching perspective is provided by the Ouroboros Model with a valid and versatile algorithmic backbone for general cognition at all accessible levels of abstraction and capabilities. Reality, rules, truth, and Free Will are all useful abstractions according to the Ouroboros Model. Logic deduction as well as intuitive guesses are claimed as produced on the basis of one compartmentalized memory for schemata and a pattern-matching, i.e., monitoring process termed consumption analysis. The latter directs attention on short (attention proper) and also on long times scales (emotional biases). In this cybernetic approach, discrepancies between expectations and actual activations (e.g., sensory precepts) drive the general process of cognition and at the same time steer the storage of new and adapted memory entries. Dedicated structures in the human brain work in concert according to this scheme.
- [104] arXiv:2403.04293 [ pdf , ps , html , other ]
-
Title: MKF-ADS: Multi-Knowledge Fusion Based Self-supervised Anomaly Detection System for Control Area NetworkComments: 14 figures, 5 tablesSubjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR)
Abstract: Control Area Network (CAN) is an essential communication protocol that interacts between Electronic Control Units (ECUs) in the vehicular network. However, CAN is facing stringent security challenges due to innate security risks. Intrusion detection systems (IDSs) are a crucial safety component in remediating Vehicular Electronics and Systems vulnerabilities. However, existing IDSs fail to identify complexity attacks and have higher false alarms owing to capability bottleneck. In this paper, we propose a self-supervised multi-knowledge fused anomaly detection model, called MKF-ADS. Specifically, the method designs an integration framework, including spatial-temporal correlation with an attention mechanism (STcAM) module and patch sparse-transformer module (PatchST). The STcAM with fine-pruning uses one-dimensional convolution (Conv1D) to extract spatial features and subsequently utilizes the Bidirectional Long Short Term Memory (Bi-LSTM) to extract the temporal features, where the attention mechanism will focus on the important time steps. Meanwhile, the PatchST captures the combined contextual features from independent univariate time series. Finally, the proposed method is based on knowledge distillation to STcAM as a student model for learning intrinsic knowledge and cross the ability to mimic PatchST. We conduct extensive experiments on six simulation attack scenarios across various CAN IDs and time steps, and two real attack scenarios, which present a competitive prediction and detection performance. Compared with the baseline in the same paradigm, the error rate and FAR are 2.62\% and 2.41\% and achieve a promising F1-score of 97.3\%.
- [105] arXiv:2403.04311 [ pdf , ps , html , other ]
-
Title: ALTO: An Efficient Network Orchestrator for Compound AI SystemsKeshav Santhanam , Deepti Raghavan , Muhammad Shahir Rahman , Thejas Venkatesh , Neha Kunjal , Pratiksha Thaker , Philip Levis , Matei ZahariaSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
Abstract: We present ALTO, a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. ALTO achieves high throughput and low latency by taking advantage of an optimization opportunity specific to generative language models: streaming intermediate outputs. As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible. We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances. We also motivate the need for an aggregation-aware routing interface and distributed prompt-aware scheduling to address these challenges. We demonstrate the impact of ALTO's partial output streaming on a complex chatbot verification pipeline, increasing throughput by up to 3x for a fixed latency target of 4 seconds / request while also reducing tail latency by 1.8x compared to a baseline serving approach.
- [106] arXiv:2403.04343 [ pdf , ps , html , other ]
-
Title: CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction TuningSubjects: Artificial Intelligence (cs.AI)
Abstract: Visual instruction tuning is a key training stage of large multimodal models (LMMs). Nevertheless, the common practice of indiscriminately mixing instruction-following data from various tasks may result in suboptimal overall performance due to different instruction formats and knowledge domains across tasks. To mitigate this issue, we propose a novel Comprehensive Task Balancing (CoTBal) algorithm for multi-task visual instruction tuning of LMMs. To our knowledge, this is the first work that explores multi-task optimization in visual instruction tuning. Specifically, we consider two key dimensions for task balancing: (1) Inter-Task Contribution, the phenomenon where learning one task potentially enhances the performance in other tasks, attributable to the overlapping knowledge domains, and (2) Intra-Task Difficulty, which refers to the learning difficulty within a single task. By quantifying these two dimensions with performance-based metrics, task balancing is thus enabled by assigning more weights to tasks that offer substantial contributions to others, receive minimal contributions from others, and also have great intra-task difficulties. Experiments show that our CoTBal leads to superior overall performance in multi-task visual instruction tuning.
- [107] arXiv:2403.04366 [ pdf , ps , html , other ]
-
Title: Enhancing Court View Generation with Knowledge Injection and GuidanceSubjects: Artificial Intelligence (cs.AI)
Abstract: Court View Generation (CVG) is a challenging task in the field of Legal Artificial Intelligence (LegalAI), which aims to generate court views based on the plaintiff claims and the fact descriptions. While Pretrained Language Models (PLMs) have showcased their prowess in natural language generation, their application to the complex, knowledge-intensive domain of CVG often reveals inherent limitations. In this paper, we present a novel approach, named Knowledge Injection and Guidance (KIG), designed to bolster CVG using PLMs. To efficiently incorporate domain knowledge during the training stage, we introduce a knowledge-injected prompt encoder for prompt tuning, thereby reducing computational overhead. Moreover, to further enhance the model's ability to utilize domain knowledge, we employ a generating navigator, which dynamically guides the text generation process in the inference stage without altering the model's architecture, making it readily transferable. Comprehensive experiments on real-world data demonstrate the effectiveness of our approach compared to several established baselines, especially in the responsivity of claims, where it outperforms the best baseline by 11.87%.
- [108] arXiv:2403.04369 [ pdf , ps , html , other ]
-
Title: From Graph to Word Bag: Introducing Domain Knowledge to Confusing Charge PredictionSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Confusing charge prediction is a challenging task in legal AI, which involves predicting confusing charges based on fact descriptions. While existing charge prediction methods have shown impressive performance, they face significant challenges when dealing with confusing charges, such as Snatch and Robbery. In the legal domain, constituent elements play a pivotal role in distinguishing confusing charges. Constituent elements are fundamental behaviors underlying criminal punishment and have subtle distinctions among charges. In this paper, we introduce a novel From Graph to Word Bag (FWGB) approach, which introduces domain knowledge regarding constituent elements to guide the model in making judgments on confusing charges, much like a judge's reasoning process. Specifically, we first construct a legal knowledge graph containing constituent elements to help select keywords for each charge, forming a word bag. Subsequently, to guide the model's attention towards the differentiating information for each charge within the context, we expand the attention mechanism and introduce a new loss function with attention supervision through words in the word bag. We construct the confusing charges dataset from real-world judicial documents. Experiments demonstrate the effectiveness of our method, especially in maintaining exceptional performance in imbalanced label distributions.
- [109] arXiv:2403.04449 [ pdf , ps , html , other ]
-
Title: Feedback-Generation for Programming Exercises With GPT-4Comments: accepted at ITiCSE 2024, Milan, ItalySubjects: Artificial Intelligence (cs.AI)
Abstract: Ever since Large Language Models (LLMs) and related applications have become broadly available, several studies investigated their potential for assisting educators and supporting students in higher education. LLMs such as Codex, GPT-3.5, and GPT 4 have shown promising results in the context of large programming courses, where students can benefit from feedback and hints if provided timely and at scale. This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen, authentic student programming submissions. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements. For example, the output is more structured and consistent. GPT-4 Turbo can also accurately identify invalid casing in student programs' output. In some cases, the feedback also includes the output of the student program. At the same time, inconsistent feedback was noted such as stating that the submission is correct but an error needs to be fixed. The present work increases our understanding of LLMs' potential, limitations, and how to integrate them into e-assessment systems, pedagogical scenarios, and instructing students who are using applications based on GPT-4.
- [110] arXiv:2403.04471 [ pdf , ps , other ]
-
Title: The Shutdown Problem: An AI Engineering Puzzle for Decision TheoristsSubjects: Artificial Intelligence (cs.AI)
Abstract: I explain the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don't try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems show that agents satisfying some innocuous-seeming conditions will often try to prevent or cause the pressing of the shutdown button, even in cases where it's costly to do so. And patience trades off against shutdownability: the more patient an agent, the greater the costs that agent is willing to incur to manipulate the shutdown button. I end by noting that these theorems can guide our search for solutions.
- [111] arXiv:2403.04483 [ pdf , ps , html , other ]
-
Title: GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning CapabilityComments: 9 pagesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Evaluating and enhancing the general capabilities of large language models (LLMs) has been an important research topic. Graph is a common data structure in the real world, and understanding graph data is a crucial part for advancing general intelligence. To evaluate and enhance the graph understanding abilities of LLMs, in this paper, we propose a benchmark named GraphInstruct, which comprehensively includes 21 classical graph reasoning tasks, providing diverse graph generation pipelines and detailed reasoning steps. Based on GraphInstruct, we further construct GraphLM through efficient instruction-tuning, which shows prominent graph understanding capability. In order to enhance the LLM with graph reasoning capability as well, we propose a step mask training strategy, and construct a model named GraphLM+. As one of the pioneering efforts to enhance the graph understanding and reasoning abilities of LLMs, extensive experiments have demonstrated the superiority of GraphLM and GraphLM+ over other LLMs. We look forward to more researchers exploring the potential of LLMs in the graph data mining domain through GraphInstruct. Our code for generating GraphInstruct is released publicly at: this https URL .
- [112] arXiv:2403.04504 [ pdf , ps , html , other ]
-
Title: Improving Matrix Completion by Exploiting Rating Ordinality in Graph Neural NetworksComments: 4 pages, 2 figures, 3 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: Matrix completion is an important area of research in recommender systems. Recent methods view a rating matrix as a user-item bi-partite graph with labeled edges denoting observed ratings and predict the edges between the user and item nodes by using the graph neural network (GNN). Despite their effectiveness, they treat each rating type as an independent relation type and thus cannot sufficiently consider the ordinal nature of the ratings. In this paper, we explore a new approach to exploit rating ordinality for GNN, which has not been studied well in the literature. We introduce a new method, called ROGMC, to leverage Rating Ordinality in GNN-based Matrix Completion. It uses cumulative preference propagation to directly incorporate rating ordinality in GNN's message passing, allowing for users' stronger preferences to be more emphasized based on inherent orders of rating types. This process is complemented by interest regularization which facilitates preference learning using the underlying interest information. Our extensive experiments show that ROGMC consistently outperforms the existing strategies of using rating types for GNN. We expect that our attempt to explore the feasibility of utilizing rating ordinality for GNN may stimulate further research in this direction.
- [113] arXiv:2403.04511 [ pdf , ps , html , other ]
-
Title: Uncovering the Deep Filter Bubble: Narrow Exposure in Short-Video RecommendationComments: accepted to WWW 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Filter bubbles have been studied extensively within the context of online content platforms due to their potential to cause undesirable outcomes such as user dissatisfaction or polarization. With the rise of short-video platforms, the filter bubble has been given extra attention because these platforms rely on an unprecedented use of the recommender system to provide relevant content. In our work, we investigate the deep filter bubble, which refers to the user being exposed to narrow content within their broad interests. We accomplish this using one-year interaction data from a top short-video platform in China, which includes hierarchical data with three levels of categories for each video. We formalize our definition of a "deep" filter bubble within this context, and then explore various correlations within the data: first understanding the evolution of the deep filter bubble over time, and later revealing some of the factors that give rise to this phenomenon, such as specific categories, user demographics, and feedback type. We observe that while the overall proportion of users in a filter bubble remains largely constant over time, the depth composition of their filter bubble changes. In addition, we find that some demographic groups that have a higher likelihood of seeing narrower content and implicit feedback signals can lead to less bubble formation. Finally, we propose some ways in which recommender systems can be designed to reduce the risk of a user getting caught in a bubble.
- [114] arXiv:2403.04541 [ pdf , ps , html , other ]
-
Title: Towards Automatic Composition of ASP Programs from Natural Language SpecificationsSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper moves the first step towards automating the composition of Answer Set Programming (ASP) specifications. In particular, the following contributions are provided: (i) A dataset focused on graph-related problem specifications, designed to develop and assess tools for ASP automatic coding; (ii) A two-step architecture, implemented in the NL2ASP tool, for generating ASP programs from natural language specifications. NL2ASP uses neural machine translation to transform natural language into Controlled Natural Language (CNL) statements. Subsequently, CNL statements are converted into ASP code using the CNL2ASP tool. An experiment confirms the viability of the approach.
- [115] arXiv:2403.04571 [ pdf , ps , html , other ]
-
Title: Machine learning and information theory concepts towards an AI MathematicianComments: To appear in the Bulletin of the AMS, 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: The current state-of-the-art in artificial intelligence is impressive, especially in terms of mastery of language, but not so much in terms of mathematical reasoning. What could be missing? Can we learn something useful about that gap from how the brains of mathematicians go about their craft? This essay builds on the idea that current deep learning mostly succeeds at system 1 abilities -- which correspond to our intuition and habitual behaviors -- but still lacks something important regarding system 2 abilities -- which include reasoning and robust uncertainty estimation. It takes an information-theoretical posture to ask questions about what constitutes an interesting mathematical statement, which could guide future work in crafting an AI mathematician. The focus is not on proving a given theorem but on discovering new and interesting conjectures. The central hypothesis is that a desirable body of theorems better summarizes the set of all provable statements, for example by having a small description length while at the same time being close (in terms of number of derivation steps) to many provable statements.
- [116] arXiv:2403.04577 [ pdf , ps , html , other ]
-
Title: Wiki-TabNER:Advancing Table Interpretation Through Named Entity RecognitionSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Web tables contain a large amount of valuable knowledge and have inspired tabular language models aimed at tackling table interpretation (TI) tasks. In this paper, we analyse a widely used benchmark dataset for evaluation of TI tasks, particularly focusing on the entity linking task. Our analysis reveals that this dataset is overly simplified, potentially reducing its effectiveness for thorough evaluation and failing to accurately represent tables as they appear in the real-world. To overcome this drawback, we construct and annotate a new more challenging dataset. In addition to introducing the new dataset, we also introduce a novel problem aimed at addressing the entity linking task: named entity recognition within cells. Finally, we propose a prompting framework for evaluating the newly developed large language models (LLMs) on this novel TI task. We conduct experiments on prompting LLMs under various settings, where we use both random and similarity-based selection to choose the examples presented to the models. Our ablation study helps us gain insights into the impact of the few-shot examples. Additionally, we perform qualitative analysis to gain insights into the challenges encountered by the models and to understand the limitations of the proposed dataset.
- [117] arXiv:2403.04588 [ pdf , ps , html , other ]
-
Title: Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global WorkspaceComments: Under review in a conferenceSubjects: Artificial Intelligence (cs.AI)
Abstract: Humans perceive the world through multiple senses, enabling them to create a comprehensive representation of their surroundings and to generalize information across domains. For instance, when a textual description of a scene is given, humans can mentally visualize it. In fields like robotics and Reinforcement Learning (RL), agents can also access information about the environment through multiple sensors; yet redundancy and complementarity between sensors is difficult to exploit as a source of robustness (e.g. against sensor failure) or generalization (e.g. transfer across domains). Prior research demonstrated that a robust and flexible multimodal representation can be efficiently constructed based on the cognitive science notion of a 'Global Workspace': a unique representation trained to combine information across modalities, and to broadcast its signal back to each modality. Here, we explore whether such a brain-inspired multimodal representation could be advantageous for RL agents. First, we train a 'Global Workspace' to exploit information collected about the environment via two input modalities (a visual input, or an attribute vector representing the state of the agent and/or its environment). Then, we train a RL agent policy using this frozen Global Workspace. In two distinct environments and tasks, our results reveal the model's ability to perform zero-shot cross-modal transfer between input modalities, i.e. to apply to image inputs a policy previously trained on attribute vectors (and vice-versa), without additional training or fine-tuning. Variants and ablations of the full Global Workspace (including a CLIP-like multimodal representation trained via contrastive learning) did not display the same generalization abilities.
- [118] arXiv:2403.04667 [ pdf , ps , html , other ]
-
Title: The Social Impact of Generative AI: An Analysis on ChatGPTMaria T. Baldassarre , Danilo Caivano , Berenice Fernandez Nieto , Domenico Gigante , Azzurra RagoneComments: Presented at GoodIT2023 - ACM Conference on Information Technology for Social GoodJournal-ref: Proceedings of the 2023 ACM Conference on Information Technology for Social Good (GoodIT '23)Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Emerging Technologies (cs.ET)
Abstract: In recent months, the social impact of Artificial Intelligence (AI) has gained considerable public interest, driven by the emergence of Generative AI models, ChatGPT in particular. The rapid development of these models has sparked heated discussions regarding their benefits, limitations, and associated risks. Generative models hold immense promise across multiple domains, such as healthcare, finance, and education, to cite a few, presenting diverse practical applications. Nevertheless, concerns about potential adverse effects have elicited divergent perspectives, ranging from privacy risks to escalating social inequality. This paper adopts a methodology to delve into the societal implications of Generative AI tools, focusing primarily on the case of ChatGPT. It evaluates the potential impact on several social sectors and illustrates the findings of a comprehensive literature review of both positive and negative effects, emerging trends, and areas of opportunity of Generative AI models. This analysis aims to facilitate an in-depth discussion by providing insights that can inspire policy, regulation, and responsible development practices to foster a human-centered AI.
- [119] arXiv:2403.04732 [ pdf , ps , html , other ]
-
Title: How Far Are We from Intelligent Visual Deductive Reasoning?Comments: ICLR 2024 AGI workshop. this https URLSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. Moreover, a detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.
- [120] arXiv:2403.04772 [ pdf , ps , html , other ]
-
Title: Representing Pedagogic Content Knowledge Through Rough SetsComments: 15+ pagesSubjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO)
Abstract: A teacher's knowledge base consists of knowledge of mathematics content, knowledge of student epistemology, and pedagogical knowledge. It has severe implications on the understanding of student's knowledge of content, and the learning context in general. The necessity to formalize the different content knowledge in approximate senses is recognized in the education research literature. A related problem is that of coherent formalizability. Existing responsive or smart AI-based software systems do not concern themselves with meaning, and trained ones are replete with their own issues. In the present research, many issues in modeling teachers' understanding of content are identified, and a two-tier rough set-based model is proposed by the present author for the purpose of developing software that can aid the varied tasks of a teacher. The main advantage of the proposed approach is in its ability to coherently handle vagueness, granularity and multi-modality. An extended example to equational reasoning is used to demonstrate these. The paper is meant for rough set researchers intending to build logical models or develop meaning-aware AI-software to aid teachers, and education research experts.
- [121] arXiv:2403.04859 [ pdf , ps , other ]
-
Title: Self-Supervision in Time for Satellite Images(S3-TSS): A novel method of SSL technique in Satellite imagesSubjects: Artificial Intelligence (cs.AI)
Abstract: With the limited availability of labeled data with various atmospheric conditions in remote sensing images, it seems useful to work with self-supervised algorithms. Few pretext-based algorithms, including from rotation, spatial context and jigsaw puzzles are not appropriate for satellite images. Often, satellite images have a higher temporal frequency. So, the temporal dimension of remote sensing data provides natural augmentation without requiring us to create artificial augmentation of images. Here, we propose S3-TSS, a novel method of self-supervised learning technique that leverages natural augmentation occurring in temporal dimension. We compare our results with current state-of-the-art methods and also perform various experiments. We observed that our method was able to perform better than baseline SeCo in four downstream datasets. Code for our work can be found here: this https URL
- [122] arXiv:2403.04866 [ pdf , ps , html , other ]
-
Title: A Modular End-to-End Multimodal Learning Method for Structured and Unstructured DataComments: 8 pages, 1 figureSubjects: Artificial Intelligence (cs.AI)
Abstract: Multimodal learning is a rapidly growing research field that has revolutionized multitasking and generative modeling in AI. While much of the research has focused on dealing with unstructured data (e.g., language, images, audio, or video), structured data (e.g., tabular data, time series, or signals) has received less attention. However, many industry-relevant use cases involve or can be benefited from both types of data. In this work, we propose a modular, end-to-end multimodal learning method called MAGNUM, which can natively handle both structured and unstructured data. MAGNUM is flexible enough to employ any specialized unimodal module to extract, compress, and fuse information from all available modalities.
- [123] arXiv:2403.04893 [ pdf , ps , html , other ]
-
Title: A Safe Harbor for AI Evaluation and Red TeamingShayne Longpre , Sayash Kapoor , Kevin Klyman , Ashwin Ramaswami , Rishi Bommasani , Borhane Blili-Hamelin , Yangsibo Huang , Aviya Skowron , Zheng-Xin Yong , Suhas Kotha , Yi Zeng , Weiyan Shi , Xianjun Yang , Reid Southen , Alexander Robey , Patrick Chao , Diyi Yang , Ruoxi Jia , Daniel Kang , Sandy Pentland , Arvind Narayanan , Percy Liang , Peter HendersonSubjects: Artificial Intelligence (cs.AI)
Abstract: Independent evaluation and red teaming are critical for identifying the risks posed by generative AI systems. However, the terms of service and enforcement strategies used by prominent AI companies to deter model misuse have disincentives on good faith safety evaluations. This causes some researchers to fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal. Although some companies offer researcher access programs, they are an inadequate substitute for independent research access, as they have limited community representation, receive inadequate funding, and lack independence from corporate incentives. We propose that major AI developers commit to providing a legal and technical safe harbor, indemnifying public interest safety research and protecting it from the threat of account suspensions or legal reprisal. These proposals emerged from our collective experience conducting safety, privacy, and trustworthiness research on generative AI systems, where norms and incentives could be better aligned with public interests, without exacerbating model misuse. We believe these commitments are a necessary step towards more inclusive and unimpeded community efforts to tackle the risks of generative AI.
- [124] arXiv:2403.04919 [ pdf , ps , html , other ]
-
Title: Identifying Causal Effects Under Functional DependenciesSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Symbolic Computation (cs.SC); Methodology (stat.ME)
Abstract: We study the identification of causal effects, motivated by two improvements to identifiability which can be attained if one knows that some variables in a causal graph are functionally determined by their parents (without needing to know the specific functions). First, an unidentifiable causal effect may become identifiable when certain variables are functional. Second, certain functional variables can be excluded from being observed without affecting the identifiability of a causal effect, which may significantly reduce the number of needed variables in observational data. Our results are largely based on an elimination procedure which removes functional variables from a causal graph while preserving key properties in the resulting causal graph, including the identifiability of causal effects.
- [125] arXiv:2403.04931 [ pdf , ps , html , other ]
-
Title: A Survey on Human-AI Teaming with Large Pre-Trained ModelsVanshika Vats , Marzia Binta Nizam , Minghao Liu , Ziyuan Wang , Richard Ho , Mohnish Sai Prasad , Vincent Titterton , Sai Venkat Malreddy , Riya Aggarwal , Yanwen Xu , Lei Ding , Jay Mehta , Nathan Grinnell , Li Liu , Sijia Zhong , Devanathan Nallur Gandamani , Xinyi Tang , Rohan Ghosalkar , Celeste Shen , Rachel Shen , Nafisa Hussain , Kesav Ravichandran , James DavisSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Abstract: In the rapidly evolving landscape of artificial intelligence (AI), the collaboration between human intelligence and AI systems, known as Human-AI (HAI) Teaming, has emerged as a cornerstone for advancing problem-solving and decision-making processes. The advent of Large Pre-trained Models (LPtM) has significantly transformed this landscape, offering unprecedented capabilities by leveraging vast amounts of data to understand and predict complex patterns. This paper surveys the pivotal integration of LPtMs with HAI, emphasizing how these models enhance collaborative intelligence beyond traditional approaches. It examines the synergistic potential of LPtMs in augmenting human capabilities, discussing this collaboration for AI model improvements, effective teaming, ethical considerations, and their broad applied implications in various sectors. Through this exploration, the study sheds light on the transformative impact of LPtM-enhanced HAI Teaming, providing insights for future research, policy development, and strategic implementations aimed at harnessing the full potential of this collaboration for research and societal benefit.
- [126] arXiv:2403.04957 [ pdf , ps , html , other ]
-
Title: Automatic and Universal Prompt Injection Attacks against Large Language ModelsComments: Pre-print, code is available at this https URLSubjects: Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. However, their capabilities can be exploited through prompt injection attacks. These attacks manipulate LLM-integrated applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. The substantial risks posed by these attacks underscore the need for a thorough understanding of the threats. Yet, research in this area faces challenges due to the lack of a unified goal for such attacks and their reliance on manually crafted prompts, complicating comprehensive assessments of prompt injection robustness. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data, even in the face of defensive measures. With only five training samples (0.3% relative to the test data), our attack can achieve superior performance compared with baselines. Our findings emphasize the importance of gradient-based testing, which can avoid overestimation of robustness, especially for defense mechanisms.
- [127] arXiv:2403.04964 [ pdf , ps , other ]
-
Title: Tell me the truth: A system to measure the trustworthiness of Large Language ModelsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: Large Language Models (LLM) have taken the front seat in most of the news since November 2022, when ChatGPT was introduced. After more than one year, one of the major reasons companies are resistant to adopting them is the limited confidence they have in the trustworthiness of those systems. In a study by (Baymard, 2023), ChatGPT-4 showed an 80.1% false-positive error rate in identifying usability issues on websites. A Jan. '24 study by JAMA Pediatrics found that ChatGPT has an accuracy rate of 17% percent when diagnosing pediatric medical cases (Barile et al., 2024). But then, what is "trust"? Trust is a relative, subject condition that can change based on culture, domain, individuals. And then, given a domain, how can the trustworthiness of a system be measured? In this paper, I present a systematic approach to measure trustworthiness based on a predefined ground truth, represented as a knowledge graph of the domain. The approach is a process with humans in the loop to validate the representation of the domain and to fine-tune the system.
Measuring the trustworthiness would be essential for all the entities operating in critical environments, such as healthcare, defense, finance, but it would be very relevant for all the users of LLMs. - [128] arXiv:2403.05000 [ pdf , ps , other ]
-
Title: Medical Speech Symptoms Classification via Disentangled RepresentationComments: Accepted by the 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2024)Subjects: Artificial Intelligence (cs.AI)
Abstract: Intent is defined for understanding spoken language in existing works. Both textual features and acoustic features involved in medical speech contain intent, which is important for symptomatic diagnosis. In this paper, we propose a medical speech classification model named DRSC that automatically learns to disentangle intent and content representations from textual-acoustic data for classification. The intent representations of the text domain and the Mel-spectrogram domain are extracted via intent encoders, and then the reconstructed text feature and the Mel-spectrogram feature are obtained through two exchanges. After combining the intent from two domains into a joint representation, the integrated intent representation is fed into a decision layer for classification. Experimental results show that our model obtains an average accuracy rate of 95% in detecting 25 different medical symptoms.
- [129] arXiv:2403.05025 [ pdf , ps , html , other ]
-
Title: Towards Multimodal Human Intention Understanding Debiasing via Subject-DeconfoundingComments: 14 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: Multimodal intention understanding (MIU) is an indispensable component of human expression analysis (e.g., sentiment or humor) from heterogeneous modalities, including visual postures, linguistic contents, and acoustic behaviors. Existing works invariably focus on designing sophisticated structures or fusion strategies to achieve impressive improvements. Unfortunately, they all suffer from the subject variation problem due to data distribution discrepancies among subjects. Concretely, MIU models are easily misled by distinct subjects with different expression customs and characteristics in the training data to learn subject-specific spurious correlations, significantly limiting performance and generalizability across uninitiated subjects.Motivated by this observation, we introduce a recapitulative causal graph to formulate the MIU procedure and analyze the confounding effect of subjects. Then, we propose SuCI, a simple yet effective causal intervention module to disentangle the impact of subjects acting as unobserved confounders and achieve model training via true causal effects. As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions. Comprehensive experiments on several MIU benchmarks clearly demonstrate the effectiveness of the proposed module.
- [130] arXiv:2403.05029 [ pdf , ps , html , other ]
-
Title: BjTT: A Large-scale Multimodal Dataset for Traffic PredictionChengyang Zhang , Yong Zhang , Qitan Shao , Jiangtao Feng , Bo Li , Yisheng Lv , Xinglin Piao , Baocai YinSubjects: Artificial Intelligence (cs.AI)
Abstract: Traffic prediction is one of the most significant foundations in Intelligent Transportation Systems (ITS). Traditional traffic prediction methods rely only on historical traffic data to predict traffic trends and face two main challenges. 1) insensitivity to unusual events. 2) limited performance in long-term prediction. In this work, we explore how generative models combined with text describing the traffic system can be applied for traffic generation, and name the task Text-to-Traffic Generation (TTG). The key challenge of the TTG task is how to associate text with the spatial structure of the road network and traffic data for generating traffic situations. To this end, we propose ChatTraffic, the first diffusion model for text-to-traffic generation. To guarantee the consistency between synthetic and real data, we augment a diffusion model with the Graph Convolutional Network (GCN) to extract spatial correlations of traffic data. In addition, we construct a large dataset containing text-traffic pairs for the TTG task. We benchmarked our model qualitatively and quantitatively on the released dataset. The experimental results indicate that ChatTraffic can generate realistic traffic situations from the text. Our code and dataset are available at this https URL .
- [131] arXiv:2403.05112 [ pdf , ps , html , other ]
-
Title: RLPeri: Accelerating Visual Perimetry Test with Reinforcement Learning and Convolutional Feature ExtractionComments: Published at AAAI-24Journal-ref: The 38th Annual AAAI Conference on Artificial Intelligence, 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Visual perimetry is an important eye examination that helps detect vision problems caused by ocular or neurological conditions. During the test, a patient's gaze is fixed at a specific location while light stimuli of varying intensities are presented in central and peripheral vision. Based on the patient's responses to the stimuli, the visual field mapping and sensitivity are determined. However, maintaining high levels of concentration throughout the test can be challenging for patients, leading to increased examination times and decreased accuracy.
In this work, we present RLPeri, a reinforcement learning-based approach to optimize visual perimetry testing. By determining the optimal sequence of locations and initial stimulus values, we aim to reduce the examination time without compromising accuracy. Additionally, we incorporate reward shaping techniques to further improve the testing performance. To monitor the patient's responses over time during testing, we represent the test's state as a pair of 3D matrices. We apply two different convolutional kernels to extract spatial features across locations as well as features across different stimulus values for each location. Through experiments, we demonstrate that our approach results in a 10-20% reduction in examination time while maintaining the accuracy as compared to state-of-the-art methods. With the presented approach, we aim to make visual perimetry testing more efficient and patient-friendly, while still providing accurate results. - [132] arXiv:2403.05130 [ pdf , ps , html , other ]
-
Title: From Chain to Tree: Refining Chain-like Rules into Tree-like Rules on Knowledge GraphsSubjects: Artificial Intelligence (cs.AI)
Abstract: With good explanatory power and controllability, rule-based methods play an important role in many tasks such as knowledge reasoning and decision support. However, existing studies primarily focused on learning chain-like rules, which limit their semantic expressions and accurate prediction abilities. As a result, chain-like rules usually fire on the incorrect grounding values, producing inaccurate or even erroneous reasoning results. In this paper, we propose the concept of tree-like rules on knowledge graphs to expand the application scope and improve the reasoning ability of rule-based methods. Meanwhile, we propose an effective framework for refining chain-like rules into tree-like rules. Experimental comparisons on four public datasets show that the proposed framework can easily adapt to other chain-like rule induction methods and the refined tree-like rules consistently achieve better performances than chain-like rules on link prediction. The data and code of this paper can be available at https://anonymous.4open.science/r/tree-rule-E3CD/.
- [133] arXiv:2403.05131 [ pdf , ps , other ]
-
Title: Sora as an AGI World Model? A Complete Survey on Text-to-Video GenerationJoseph Cho , Fachrina Dewi Puspitasari , Sheng Zheng , Jingyao Zheng , Lik-Hang Lee , Tae-Ho Kim , Choong Seon Hong , Chaoning ZhangComments: First complete survey on Text-to-Video Generation, 36 pages, 16 figuresSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV)
Abstract: Text-to-video generation marks a significant frontier in the rapidly evolving domain of generative AI, integrating advancements in text-to-image synthesis, video captioning, and text-guided editing. This survey critically examines the progression of text-to-video technologies, focusing on the shift from traditional generative models to the cutting-edge Sora model, highlighting developments in scalability and generalizability. Distinguishing our analysis from prior works, we offer an in-depth exploration of the technological frameworks and evolutionary pathways of these models. Additionally, we delve into practical applications and address ethical and technological challenges such as the inability to perform multiple entity handling, comprehend causal-effect learning, understand physical interaction, perceive object scaling and proportioning, and combat object hallucination which is also a long-standing problem in generative models. Our comprehensive discussion covers the topic of enablement of text-to-video generation models as human-assistive tools and world models, as well as eliciting model's shortcomings and summarizing future improvement direction that mainly centers around training datasets and evaluation metrics (both automatic and human-centered). Aimed at both newcomers and seasoned researchers, this survey seeks to catalyze further innovation and discussion in the growing field of text-to-video generation, paving the way for more reliable and practical generative artificial intelligence technologies.
- [134] arXiv:2403.05229 [ pdf , ps , other ]
-
Title: Developing Federated Time-to-Event Scores Using Heterogeneous Real-World Survival DataSiqi Li , Yuqing Shang , Ziwen Wang , Qiming Wu , Chuan Hong , Yilin Ning , Di Miao , Marcus Eng Hock Ong , Bibhas Chakraborty , Nan LiuSubjects: Artificial Intelligence (cs.AI)
Abstract: Survival analysis serves as a fundamental component in numerous healthcare applications, where the determination of the time to specific events (such as the onset of a certain disease or death) for patients is crucial for clinical decision-making. Scoring systems are widely used for swift and efficient risk prediction. However, existing methods for constructing survival scores presume that data originates from a single source, posing privacy challenges in collaborations with multiple data owners. We propose a novel framework for building federated scoring systems for multi-site survival outcomes, ensuring both privacy and communication efficiency. We applied our approach to sites with heterogeneous survival data originating from emergency departments in Singapore and the United States. Additionally, we independently developed local scores at each site. In testing datasets from each participant site, our proposed federated scoring system consistently outperformed all local models, evidenced by higher integrated area under the receiver operating characteristic curve (iAUC) values, with a maximum improvement of 11.6%. Additionally, the federated score's time-dependent AUC(t) values showed advantages over local scores, exhibiting narrower confidence intervals (CIs) across most time points. The model developed through our proposed method exhibits effective performance on each local site, signifying noteworthy implications for healthcare research. Sites participating in our proposed federated scoring model training gained benefits by acquiring survival models with enhanced prediction accuracy and efficiency. This study demonstrates the effectiveness of our privacy-preserving federated survival score generation framework and its applicability to real-world heterogeneous survival data.
- [135] arXiv:2403.05260 [ pdf , ps , html , other ]
-
Title: Predicting Single-cell Drug Sensitivity by Adaptive Weighted Feature for Adversarial Multi-source Domain AdaptationSubjects: Artificial Intelligence (cs.AI)
Abstract: The development of single-cell sequencing technology had promoted the generation of a large amount of single-cell transcriptional profiles, providing valuable opportunities to explore drug-resistant cell subpopulations in a tumor. However, the drug sensitivity data in single-cell level is still scarce to date, pressing an urgent and highly challenging task for computational prediction of the drug sensitivity to individual cells. This paper proposed scAdaDrug, a multi-source adaptive weighting model to predict single-cell drug sensitivity. We used an autoencoder to extract domain-invariant features related to drug sensitivity from multiple source domains by exploiting adversarial domain adaptation. Especially, we introduced an adaptive weight generator to produce importance-aware and mutual independent weights, which could adaptively modulate the embedding of each sample in dimension-level for both source and target domains. Extensive experimental results showed that our model achieved state-of-the-art performance in predicting drug sensitivity on sinle-cell datasets, as well as on cell line and patient datasets.
- [136] arXiv:2403.05265 [ pdf , ps , html , other ]
-
Title: MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-ExpertsSubjects: Artificial Intelligence (cs.AI)
Abstract: Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews' text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user's information of a review could be helpful. Besides, the spoiler language of movie reviews tends to be genre-specific, thus posing a domain generalization challenge for existing methods. To this end, we propose MMoE, a multi-modal network that utilizes information from multiple modalities to facilitate robust spoiler detection and adopts Mixture-of-Experts to enhance domain generalization. MMoE first extracts graph, text, and meta feature from the user-movie network, the review's textual content, and the review's metadata respectively. To handle genre-specific spoilers, we then adopt Mixture-of-Experts architecture to process information in three modalities to promote robustness. Finally, we use an expert fusion layer to integrate the features from different perspectives and make predictions based on the fused embedding. Experiments demonstrate that MMoE achieves state-of-the-art performance on two widely-used spoiler detection datasets, surpassing previous SOTA methods by 2.56% and 8.41% in terms of accuracy and F1-score. Further experiments also demonstrate MMoE's superiority in robustness and generalization.
- [137] arXiv:2403.05307 [ pdf , ps , html , other ]
-
Title: Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis AgentsJinyang Li , Nan Huo , Yan Gao , Jiayi Shi , Yingxiu Zhao , Ge Qu , Yurong Wu , Chenhao Ma , Jian-Guang Lou , Reynold ChengComments: 30 pages, 7 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Interactive Data Analysis, the collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of Large Language Model (LLM) agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.5%.
- [138] arXiv:2403.05318 [ pdf , ps , html , other ]
-
Title: Looking Ahead to Avoid Being Late: Solving Hard-Constrained Traveling Salesman ProblemSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Many real-world problems can be formulated as a constrained Traveling Salesman Problem (TSP). However, the constraints are always complex and numerous, making the TSPs challenging to solve. When the number of complicated constraints grows, it is time-consuming for traditional heuristic algorithms to avoid illegitimate outcomes. Learning-based methods provide an alternative to solve TSPs in a soft manner, which also supports GPU acceleration to generate solutions quickly. Nevertheless, the soft manner inevitably results in difficulty solving hard-constrained problems with learning algorithms, and the conflicts between legality and optimality may substantially affect the optimality of the solution. To overcome this problem and to have an effective solution against hard constraints, we proposed a novel learning-based method that uses looking-ahead information as the feature to improve the legality of TSP with Time Windows (TSPTW) solutions. Besides, we constructed TSPTW datasets with hard constraints in order to accurately evaluate and benchmark the statistical performance of various approaches, which can serve the community for future research. With comprehensive experiments on diverse datasets, MUSLA outperforms existing baselines and shows generalizability potential.
- [139] arXiv:2403.05407 [ pdf , ps , html , other ]
-
Title: Algorithmic Identification of Essential Exogenous Nodes for Causal Sufficiency in Brain NetworksSubjects: Artificial Intelligence (cs.AI)
Abstract: In the investigation of any causal mechanisms, such as the brain's causal networks, the assumption of causal sufficiency plays a critical role. Notably, neglecting this assumption can result in significant errors, a fact that is often disregarded in the causal analysis of brain networks. In this study, we propose an algorithmic identification approach for determining essential exogenous nodes that satisfy the critical need for causal sufficiency to adhere to it in such inquiries. Our approach consists of three main steps: First, by capturing the essence of the Peter-Clark (PC) algorithm, we conduct independence tests for pairs of regions within a network, as well as for the same pairs conditioned on nodes from other networks. Next, we distinguish candidate confounders by analyzing the differences between the conditional and unconditional results, using the Kolmogorov-Smirnov test. Subsequently, we utilize Non-Factorized identifiable Variational Autoencoders (NF-iVAE) along with the Correlation Coefficient index (CCI) metric to identify the confounding variables within these candidate nodes. Applying our method to the Human Connectome Projects (HCP) movie-watching task data, we demonstrate that while interactions exist between dorsal and ventral regions, only dorsal regions serve as confounders for the visual networks, and vice versa. These findings align consistently with those resulting from the neuroscientific perspective. Finally, we show the reliability of our results by testing 30 independent runs for NF-iVAE initialization.
- [140] arXiv:2403.05525 [ pdf , ps , html , other ]
-
Title: DeepSeek-VL: Towards Real-World Vision-Language UnderstandingHaoyu Lu , Wen Liu , Bo Zhang , Bingxuan Wang , Kai Dong , Bo Liu , Jingxiang Sun , Tongzheng Ren , Zhuoshu Li , Hao Yang , Yaofeng Sun , Chengqi Deng , Hanwei Xu , Zhenda Xie , Chong RuanComments: this https URLSubjects: Artificial Intelligence (cs.AI)
Abstract: We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions:
We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities.
The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model. - [141] arXiv:2403.05632 [ pdf , ps , html , other ]
-
Title: Can Large Language Models Play Games? A Case Study of A Self-Play ApproachSubjects: Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge. While LLMs have proven beneficial as decision-making aids, their reliability is hampered by limitations in reasoning, hallucination phenomenon, and so on. On the other hand, Monte-Carlo Tree Search (MCTS) is a heuristic search algorithm that provides reliable decision-making solutions, achieved through recursive rollouts and self-play. However, the effectiveness of MCTS relies heavily on heuristic pruning and external value functions, particularly in complex decision scenarios. This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve deterministic turn-based zero-sum games (DTZG), such as chess and go, without the need for additional training. Specifically, we utilize LLMs as both action pruners and proxies for value functions without the need for additional training. We theoretically prove that the suboptimality of the estimated value in our proposed method scales with $\tilde{\mathcal O}\Bigl(\frac{|\tilde {\mathcal A}|}{\sqrt{N}} + \epsilon_\mathrm{pruner} + \epsilon_\mathrm{critic}\Bigr)$, where \(N\) is the number of simulations, $|\tilde {\mathcal A}|$ is the cardinality of the pruned action space by LLM, and $\epsilon_\mathrm{pruner}$ and $\epsilon_\mathrm{critic}$ quantify the errors incurred by adopting LLMs as action space pruner and value function proxy, respectively. Our experiments in chess and go demonstrate the capability of our method to address challenges beyond the scope of MCTS and improve the performance of the directly application of LLMs.
- [142] arXiv:2403.05636 [ pdf , ps , html , other ]
-
Title: Tuning-Free Accountable Intervention for LLM Deployment -- A Metacognitive ApproachSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks through few-shot or zero-shot prompting, bypassing the need for parameter tuning. While convenient, this modus operandi aggravates ``hallucination'' concerns, particularly given the enigmatic ``black-box'' nature behind their gigantic model sizes. Such concerns are exacerbated in high-stakes applications (e.g., healthcare), where unaccountable decision errors can lead to devastating consequences. In contrast, human decision-making relies on nuanced cognitive processes, such as the ability to sense and adaptively correct misjudgments through conceptual understanding. Drawing inspiration from human cognition, we propose an innovative \textit{metacognitive} approach, dubbed \textbf{CLEAR}, to equip LLMs with capabilities for self-aware error identification and correction. Our framework facilitates the construction of concept-specific sparse subnetworks that illuminate transparent decision pathways. This provides a novel interface for model \textit{intervention} after deployment. Our intervention offers compelling advantages: (\textit{i})~at deployment or inference time, our metacognitive LLMs can self-consciously identify potential mispredictions with minimum human involvement, (\textit{ii})~the model has the capability to self-correct its errors efficiently, obviating the need for additional tuning, and (\textit{iii})~the rectification procedure is not only self-explanatory but also user-friendly, enhancing the interpretability and accessibility of the model. By integrating these metacognitive features, our approach pioneers a new path toward engendering greater trustworthiness and accountability in the deployment of LLMs.
- [143] arXiv:2403.05641 [ pdf , ps , other ]
-
Title: A Feature-based Generalizable Prediction Model for Both Perceptual and Abstract ReasoningSubjects: Artificial Intelligence (cs.AI) ; Neurons and Cognition (q-bio.NC)
Abstract: A hallmark of human intelligence is the ability to infer abstract rules from limited experience and apply these rules to unfamiliar situations. This capacity is widely studied in the visual domain using the Raven's Progressive Matrices. Recent advances in deep learning have led to multiple artificial neural network models matching or even surpassing human performance. However, while humans can identify and express the rule underlying these tasks with little to no exposure, contemporary neural networks often rely on massive pattern-based training and cannot express or extrapolate the rule inferred from the task. Furthermore, most Raven's Progressive Matrices or Raven-like tasks used for neural network training used symbolic representations, whereas humans can flexibly switch between symbolic and continuous perceptual representations. In this work, we present an algorithmic approach to rule detection and application using feature detection, affine transformation estimation and search. We applied our model to a simplified Raven's Progressive Matrices task, previously designed for behavioral testing and neuroimaging in humans. The model exhibited one-shot learning and achieved near human-level performance in the symbolic reasoning condition of the simplified task. Furthermore, the model can express the relationships discovered and generate multi-step predictions in accordance with the underlying rule. Finally, the model can reason using continuous patterns. We discuss our results and their relevance to studying abstract reasoning in humans, as well as their implications for improving intelligent machines.
- [144] arXiv:2403.05680 [ pdf , ps , html , other ]
-
Title: Decomposing Vision-based LLM Predictions for Auto-Evaluation with GPT-4Qingqing Zhu , Benjamin Hou , Tejas S. Mathai , Pritam Mukherjee , Qiao Jin , Xiuying Chen , Zhizheng Wang , Ruida Cheng , Ronald M. Summers , Zhiyong LuSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The volume of CT exams being done in the world has been rising every year, which has led to radiologist burn-out. Large Language Models (LLMs) have the potential to reduce their burden, but their adoption in the clinic depends on radiologist trust, and easy evaluation of generated content. Presently, many automated methods are available to evaluate the reports generated for chest radiographs, but such an approach is not available for CT presently. In this paper, we propose a novel evaluation framework to judge the capabilities of vision-language LLMs in generating accurate summaries of CT-based abnormalities. CT slices containing an abnormality (e.g., lesion) were input to a vision-based LLM (GPT-4V, LLaVA-Med, and RadFM), and it generated a free-text summary of the predicted characteristics of the abnormality. Next, a GPT-4 model decomposed the summary into specific aspects (body part, location, type, and attributes), automatically evaluated the characteristics against the ground-truth, and generated a score for each aspect based on its clinical relevance and factual accuracy. These scores were then contrasted against those obtained from a clinician, and a high correlation ( 85%, p < .001) was observed. Although GPT-4V outperformed other models in our evaluation, it still requires overall improvement. Our evaluation method offers valuable insights into the specific areas that need the most enhancement, guiding future development in this field.
- [145] arXiv:2403.05683 [ pdf , ps , html , other ]
-
Title: Efficient Public Health Intervention Planning Using Decomposition-Based Decision-Focused LearningComments: 12 pages, 3 figures, 2 tablesSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: The declining participation of beneficiaries over time is a key concern in public health programs. A popular strategy for improving retention is to have health workers `intervene' on beneficiaries at risk of dropping out. However, the availability and time of these health workers are limited resources. As a result, there has been a line of research on optimizing these limited intervention resources using Restless Multi-Armed Bandits (RMABs). The key technical barrier to using this framework in practice lies in the need to estimate the beneficiaries' RMAB parameters from historical data. Recent research has shown that Decision-Focused Learning (DFL), which focuses on maximizing the beneficiaries' adherence rather than predictive accuracy, improves the performance of intervention targeting using RMABs. Unfortunately, these gains come at a high computational cost because of the need to solve and evaluate the RMAB in each DFL training step. In this paper, we provide a principled way to exploit the structure of RMABs to speed up intervention planning by cleverly decoupling the planning for different beneficiaries. We use real-world data from an Indian NGO, ARMMAN, to show that our approach is up to two orders of magnitude faster than the state-of-the-art approach while also yielding superior model performance. This would enable the NGO to scale up deployments using DFL to potentially millions of mothers, ultimately advancing progress toward UNSDG 3.1.
- [146] arXiv:2403.05732 [ pdf , ps , html , other ]
-
Title: Conservative DDPG -- Pessimistic RL without EnsembleSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: DDPG is hindered by the overestimation bias problem, wherein its $Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to this bias involve ensemble-based methods, which require significant computational resources, or complex log-policy-based approaches, which are difficult to understand and implement. In contrast, we propose a straightforward solution using a $Q$-target and incorporating a behavioral cloning (BC) loss penalty. This solution, acting as an uncertainty measure, can be easily implemented with minimal code and without the need for an ensemble. Our empirical findings strongly support the superiority of Conservative DDPG over DDPG across various MuJoCo and Bullet tasks. We consistently observe better performance in all evaluated tasks and even competitive or superior performance compared to TD3 and TD7, all achieved with significantly reduced computational requirements.
- [147] arXiv:2403.05801 [ pdf , ps , html , other ]
-
Title: Enhancing Multi-Hop Knowledge Graph Reasoning through Reward Shaping TechniquesComments: This paper has been accepted by the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT 2024)Subjects: Artificial Intelligence (cs.AI)
Abstract: In the realm of computational knowledge representation, Knowledge Graph Reasoning (KG-R) stands at the forefront of facilitating sophisticated inferential capabilities across multifarious domains. The quintessence of this research elucidates the employment of reinforcement learning (RL) strategies, notably the REINFORCE algorithm, to navigate the intricacies inherent in multi-hop KG-R. This investigation critically addresses the prevalent challenges introduced by the inherent incompleteness of Knowledge Graphs (KGs), which frequently results in erroneous inferential outcomes, manifesting as both false negatives and misleading positives. By partitioning the Unified Medical Language System (UMLS) benchmark dataset into rich and sparse subsets, we investigate the efficacy of pre-trained BERT embeddings and Prompt Learning methodologies to refine the reward shaping process. This approach not only enhances the precision of multi-hop KG-R but also sets a new precedent for future research in the field, aiming to improve the robustness and accuracy of knowledge inference within complex KG frameworks. Our work contributes a novel perspective to the discourse on KG reasoning, offering a methodological advancement that aligns with the academic rigor and scholarly aspirations of the Natural journal, promising to invigorate further advancements in the realm of computational knowledge representation.
- [148] arXiv:2403.05921 [ pdf , ps , html , other ]
-
Title: OntoChat: a Framework for Conversational Ontology Engineering using Language ModelsBohui Zhang , Valentina Anita Carriero , Katrin Schreiberhuber , Stefani Tsaneva , Lucía Sánchez González , Jongmo Kim , Jacopo de BerardinisComments: ESWC 2024 Special Track on Large Language Models for Knowledge EngineeringSubjects: Artificial Intelligence (cs.AI)
Abstract: Ontology engineering (OE) in large projects poses a number of challenges arising from the heterogeneous backgrounds of the various stakeholders, domain experts, and their complex interactions with ontology designers. This multi-party interaction often creates systematic ambiguities and biases from the elicitation of ontology requirements, which directly affect the design, evaluation and may jeopardise the target reuse. Meanwhile, current OE methodologies strongly rely on manual activities (e.g., interviews, discussion pages). After collecting evidence on the most crucial OE activities, we introduce \textbf{OntoChat}, a framework for conversational ontology engineering that supports requirement elicitation, analysis, and testing. By interacting with a conversational agent, users can steer the creation of user stories and the extraction of competency questions, while receiving computational support to analyse the overall requirements and test early versions of the resulting ontologies. We evaluate OntoChat by replicating the engineering of the Music Meta Ontology, and collecting preliminary metrics on the effectiveness of each component from users. We release all code at this https URL .
- [149] arXiv:2403.06086 [ pdf , ps , html , other ]
-
Title: Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes ApproachComments: Accepted at AISTATS 2024Subjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: Estimating the potential behavior of the surrounding human-driven vehicles is crucial for the safety of autonomous vehicles in a mixed traffic flow. Recent state-of-the-art achieved accurate prediction using deep neural networks. However, these end-to-end models are usually black boxes with weak interpretability and generalizability. This paper proposes the Goal-based Neural Variational Agent (GNeVA), an interpretable generative model for motion prediction with robust generalizability to out-of-distribution cases. For interpretability, the model achieves target-driven motion prediction by estimating the spatial distribution of long-term destinations with a variational mixture of Gaussians. We identify a causal structure among maps and agents' histories and derive a variational posterior to enhance generalizability. Experiments on motion prediction datasets validate that the fitted model can be interpretable and generalizable and can achieve comparable performance to state-of-the-art results.
- [150] arXiv:2403.06221 [ pdf , ps , html , other ]
-
Title: TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned DecisionRuiwen Zhou , Yingxuan Yang , Muning Wen , Ying Wen , Wenhao Wang , Chunling Xi , Guoqiang Xu , Yong Yu , Weinan ZhangComments: Codes available at: this https URLSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Information Retrieval (cs.IR)
Abstract: Numerous large language model (LLM) agents have been built for different tasks like web navigation and online shopping due to LLM's wide knowledge and text-understanding ability. Among these works, many of them utilize in-context examples to achieve generalization without the need for fine-tuning, while few of them have considered the problem of how to select and effectively utilize these examples. Recently, methods based on trajectory-level retrieval with task meta-data and using trajectories as in-context examples have been proposed to improve the agent's overall performance in some sequential decision making tasks. However, these methods can be problematic due to plausible examples retrieved without task-specific state transition dynamics and long input with plenty of irrelevant context. In this paper, we propose a novel framework (TRAD) to address these issues. TRAD first conducts Thought Retrieval, achieving step-level demonstration selection via thought matching, leading to more helpful demonstrations and less irrelevant input noise. Then, TRAD introduces Aligned Decision, complementing retrieved demonstration steps with their previous or subsequent steps, which enables tolerance for imperfect thought and provides a choice for balance between more context and less noise. Extensive experiments on ALFWorld and Mind2Web benchmarks show that TRAD not only outperforms state-of-the-art models but also effectively helps in reducing noise and promoting generalization. Furthermore, TRAD has been deployed in real-world scenarios of a global business insurance company and improves the success rate of robotic process automation.
- [151] arXiv:2403.06294 [ pdf , ps , html , other ]
-
Title: ArgMed-Agents: Explainable Clinical Decision Reasoning with Large Language Models via Argumentation SchemesSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)
Abstract: There are two main barriers to using large language models (LLMs) in clinical reasoning. Firstly, while LLMs exhibit significant promise in Natural Language Processing (NLP) tasks, their performance in complex reasoning and planning falls short of expectations. Secondly, LLMs use uninterpretable methods to make clinical decisions that are fundamentally different from the clinician's cognitive processes. This leads to user distrust. In this paper, we present a multi-agent framework called ArgMed-Agents, which aims to enable LLM-based agents to make explainable clinical decision reasoning through interaction. ArgMed-Agents performs self-argumentation iterations via Argumentation Scheme for Clinical Decision (a reasoning mechanism for modeling cognitive processes in clinical reasoning), and then constructs the argumentation process as a directed graph representing conflicting relationships. Ultimately, Reasoner(a symbolic solver) identify a series of rational and coherent arguments to support decision. ArgMed-Agents enables LLMs to mimic the process of clinical argumentative reasoning by generating explanations of reasoning in a self-directed manner. The setup experiments show that ArgMed-Agents not only improves accuracy in complex clinical decision reasoning problems compared to other prompt methods, but more importantly, it provides users with decision explanations that increase their confidence.
- [152] arXiv:2403.06483 [ pdf , ps , other ]
-
Title: The negation of permutation mass functionSubjects: Artificial Intelligence (cs.AI) ; Information Theory (cs.IT)
Abstract: Negation is an important perspective of knowledge representation. Existing negation methods are mainly applied in probability theory, evidence theory and complex evidence theory. As a generalization of evidence theory, random permutation sets theory may represent information more precisely. However, how to apply the concept of negation to random permutation sets theory has not been studied. In this paper, the negation of permutation mass function is proposed. Moreover, in the negation process, the convergence of proposed negation method is verified. The trends of uncertainty and dissimilarity after each negation operation are investigated. Numerical examples are used to demonstrate the rationality of the proposed method.
- [153] arXiv:2403.06568 [ pdf , ps , html , other ]
-
Title: Better Understandings and Configurations in MaxSAT Local Search Solvers via Anytime Performance AnalysisSubjects: Artificial Intelligence (cs.AI)
Abstract: Though numerous solvers have been proposed for the MaxSAT problem, and the benchmark environment such as MaxSAT Evaluations provides a platform for the comparison of the state-of-the-art solvers, existing assessments were usually evaluated based on the quality, e.g., fitness, of the best-found solutions obtained within a given running time budget. However, concerning solely the final obtained solutions regarding specific time budgets may restrict us from comprehending the behavior of the solvers along the convergence process. This paper demonstrates that Empirical Cumulative Distribution Functions can be used to compare MaxSAT local search solvers' anytime performance across multiple problem instances and various time budgets. The assessment reveals distinctions in solvers' performance and displays that the (dis)advantages of solvers adjust along different running times. This work also exhibits that the quantitative and high variance assessment of anytime performance can guide machines, i.e., automatic configurators, to search for better parameter settings. Our experimental results show that the hyperparameter optimization tool, i.e., SMAC, generally achieves better parameter settings of local search when using the anytime performance as the cost function, compared to using the fitness of the best-found solutions.
- [154] arXiv:2403.06734 [ pdf , ps , html , other ]
-
Title: Real-Time Multimodal Cognitive Assistant for Emergency Medical ServicesKeshara Weerasinghe , Saahith Janapati , Xueren Ge , Sion Kim , Sneha Iyer , John A. Stankovic , Homa AlemzadehComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Emergency Medical Services (EMS) responders often operate under time-sensitive conditions, facing cognitive overload and inherent risks, requiring essential skills in critical thinking and rapid decision-making. This paper presents CognitiveEMS, an end-to-end wearable cognitive assistant system that can act as a collaborative virtual partner engaging in the real-time acquisition and analysis of multimodal data from an emergency scene and interacting with EMS responders through Augmented Reality (AR) smart glasses. CognitiveEMS processes the continuous streams of data in real-time and leverages edge computing to provide assistance in EMS protocol selection and intervention recognition. We address key technical challenges in real-time cognitive assistance by introducing three novel components: (i) a Speech Recognition model that is fine-tuned for real-world medical emergency conversations using simulated EMS audio recordings, augmented with synthetic data generated by large language models (LLMs); (ii) an EMS Protocol Prediction model that combines state-of-the-art (SOTA) tiny language models with EMS domain knowledge using graph-based attention mechanisms; (iii) an EMS Action Recognition module which leverages multimodal audio and video data and protocol predictions to infer the intervention/treatment actions taken by the responders at the incident scene. Our results show that for speech recognition we achieve superior performance compared to SOTA (WER of 0.290 vs. 0.618) on conversational data. Our protocol prediction component also significantly outperforms SOTA (top-3 accuracy of 0.800 vs. 0.200) and the action recognition achieves an accuracy of 0.727, while maintaining an end-to-end latency of 3.78s for protocol prediction on the edge and 0.31s on the server.
- [155] arXiv:2403.06843 [ pdf , ps , html , other ]
-
Title: Towards an educational tool for supporting neonatologists in the delivery roomGiorgio Leonardi , Clara Maldarizzi , Stefania Montani , Manuel Striani , Mariachiara Martina StrozziComments: 9 pages, 5 figures, conference paperSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Nowadays, there is evidence that several factors may increase the risk, for an infant, to require stabilisation or resuscitation manoeuvres at birth. However, this risk factors are not completely known, and a universally applicable model for predicting high-risk situations is not available yet. Considering both these limitations and the fact that the need for resuscitation at birth is a rare event, periodic training of the healthcare personnel responsible for newborn caring in the delivery room is mandatory.
In this paper, we propose a machine learning approach for identifying risk factors and their impact on the birth event from real data, which can be used by personnel to progressively increase and update their knowledge. Our final goal will be the one of designing a user-friendly mobile application, able to improve the recognition rate and the planning of the appropriate interventions on high-risk patients. - [156] arXiv:2403.06910 [ pdf , ps , html , other ]
-
Title: Responsible Artificial Intelligence: A Structured Literature ReviewSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Our research endeavors to advance the concept of responsible artificial intelligence (AI), a topic of increasing importance within EU policy discussions. The EU has recently issued several publications emphasizing the necessity of trust in AI, underscoring the dual nature of AI as both a beneficial tool and a potential weapon. This dichotomy highlights the urgent need for international regulation. Concurrently, there is a need for frameworks that guide companies in AI development, ensuring compliance with such regulations. Our research aims to assist lawmakers and machine learning practitioners in navigating the evolving landscape of AI regulation, identifying focal areas for future attention. This paper introduces a comprehensive and, to our knowledge, the first unified definition of responsible AI. Through a structured literature review, we elucidate the current understanding of responsible AI. Drawing from this analysis, we propose an approach for developing a future framework centered around this concept. Our findings advocate for a human-centric approach to Responsible AI. This approach encompasses the implementation of AI methods with a strong emphasis on ethics, model explainability, and the pillars of privacy, security, and trust.
- [157] arXiv:2403.06995 [ pdf , ps , html , other ]
-
Title: Exact algorithms and heuristics for capacitated covering salesman problemsSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper introduces the Capacitated Covering Salesman Problem (CCSP), approaching the notion of service by coverage in capacitated vehicle routing problems. In CCSP, locations where vehicles can transit are provided, some of which have customers with demands. The objective is to service customers through a fleet of vehicles based in a depot, minimizing the total distance traversed by the vehicles. CCSP is unique in the sense that customers, to be serviced, do not need to be visited by a vehicle. Instead, they can be serviced if they are within a coverage area of the vehicle. This assumption is motivated by applications in which some customers are unreachable (e.g., forbidden access to vehicles) or visiting every customer is impractical. In this work, optimization methodologies are proposed for the CCSP based on ILP (Integer Linear Programming) and BRKGA (Biased Random-Key Genetic Algorithm) metaheuristic. Computational experiments conducted on a benchmark of instances for the CCSP evaluate the performance of the methodologies with respect to primal bounds. Furthermore, our ILP formulation is extended in order to create a novel MILP (Mixed Integer Linear Programming) for the Multi-Depot Covering Tour Vehicle Routing Problem (MDCTVRP). Computational experiments show that the extended MILP formulation outperformed the previous state-of-the-art exact approach with respect to optimality gaps. In particular, optimal solutions were obtained for several previously unsolved instances.
- [158] arXiv:2403.06996 [ pdf , ps , html , other ]
-
Title: On the stochastics of human and artificial creativityComments: 40 pages, 1 figure with 2 sub-figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: What constitutes human creativity, and is it possible for computers to exhibit genuine creativity? We argue that achieving human-level intelligence in computers, or so-called Artificial General Intelligence, necessitates attaining also human-level creativity. We contribute to this discussion by developing a statistical representation of human creativity, incorporating prior insights from stochastic theory, psychology, philosophy, neuroscience, and chaos theory. This highlights the stochastic nature of the human creative process, which includes both a bias guided, random proposal step, and an evaluation step depending on a flexible or transformable bias structure. The acquired representation of human creativity is subsequently used to assess the creativity levels of various contemporary AI systems. Our analysis includes modern AI algorithms such as reinforcement learning, diffusion models, and large language models, addressing to what extent they measure up to human level creativity. We conclude that these technologies currently lack the capability for autonomous creative action at a human level.
- [159] arXiv:2403.07003 [ pdf , ps , html , other ]
-
Title: Evacuation Management Framework towards Smart City-wide Intelligent Emergency Interactive Response SystemSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Abstract: A smart city solution toward future 6G network deployment allows small and medium sized enterprises (SMEs), industry, and government entities to connect with the infrastructures and play a crucial role in enhancing emergency preparedness with advanced sensors. The objective of this work is to propose a set of coordinated technological solutions to transform an existing emergency response system into an intelligent interactive system, thereby improving the public services and the quality of life for residents at home, on road, in hospitals, transport hubs, etc. In this context, we consider a city wide view from three different application scenes that are closely related to peoples daily life, to optimize the actions taken at relevant departments. Therefore, using artificial intelligence (AI) and machine learning (ML) techniques to enable the next generation connected vehicle experiences, we specifically focus on accidents happening in indoor households, urban roads, and at large public facilities. This smart interactive response system will benefit from advanced sensor fusion and AI by formulating a real time dynamic model.
- [160] arXiv:2403.07004 [ pdf , ps , html , other ]
-
Title: Convergence of Some Convex Message Passing Algorithms to a Fixed PointSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract: A popular approach to the MAP inference problem in graphical models is to minimize an upper bound obtained from a dual linear programming or Lagrangian relaxation by (block-)coordinate descent. Examples of such algorithms are max-sum diffusion and sequential tree-reweighted message passing. Convergence properties of these methods are currently not fully understood. They have been proved to converge to the set characterized by local consistency of active constraints, with unknown convergence rate; however, it was not clear if the iterates converge at all (to any single point). We prove a stronger result (which was conjectured before but never proved): the iterates converge to a fixed point of the algorithm. Moreover, we show that they achieve precision $\varepsilon>0$ in $\mathcal{O}(1/\varepsilon)$ iterations.
We first prove this for a version of coordinate descent applied to a general piecewise-affine convex objective, using a novel proof technique. Then we demonstrate the generality of this approach by reducing some popular coordinate-descent algorithms to this problem. Finally we show that, in contrast to our main result, a similar version of coordinate descent applied to a constrained optimization problem need not converge. - [161] arXiv:2403.07005 [ pdf , ps , html , other ]
-
Title: Multi-Agent Reinforcement Learning with a Hierarchy of Reward MachinesSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: In this paper, we study the cooperative Multi-Agent Reinforcement Learning (MARL) problems using Reward Machines (RMs) to specify the reward functions such that the prior knowledge of high-level events in a task can be leveraged to facilitate the learning efficiency. Unlike the existing work that RMs have been incorporated into MARL for task decomposition and policy learning in relatively simple domains or with an assumption of independencies among the agents, we present Multi-Agent Reinforcement Learning with a Hierarchy of RMs (MAHRM) that is capable of dealing with more complex scenarios when the events among agents can occur concurrently and the agents are highly interdependent.
MAHRM exploits the relationship of high-level events to decompose a task into a hierarchy of simpler subtasks that are assigned to a small group of agents, so as to reduce the overall computational complexity.
Experimental results in three cooperative MARL domains show that MAHRM outperforms other MARL methods using the same prior knowledge of high-level events. - [162] arXiv:2403.07010 [ pdf , ps , other ]
-
Title: On Globular T-Spherical Fuzzy (G-TSF) Sets with Application to G-TSF Multi-Criteria Group Decision-MakingSubjects: Artificial Intelligence (cs.AI)
Abstract: In this paper, we give the concept of Globular T-Spherical Fuzzy (G-TSF) Sets (G-TSFSs) as an innovative extension of T-Spherical Fuzzy Sets (TSFSs) and Circular Spherical Fuzzy Sets (C-SFSs). G-TSFSs represent membership, indeterminacy, and non-membership degrees using a globular/sphere bound that can offer a more accurate portrayal of vague, ambiguous, and imprecise information. By employing a structured representation of data points on a sphere with a specific center and radius, this model enhances decision-making processes by enabling a more comprehensive evaluation of objects within a flexible region. Following the newly defined G-TSFSs, we establish some basic set operations and introduce fundamental algebraic operations for G-TSF Values (G-TSFVs). These operations expand the evaluative capabilities of decision-makers, facilitating more sensitive decision-making processes in a broader region. To quantify a similarity measure (SM) between GTSFVs, the SM is defined based on the radius of G-TSFSs. Additionally, Hamming distance and Euclidean distance are introduced for G-TSFSs. We also present theorems and examples to elucidate computational mechanisms. Furthermore, we give the G-TSF Weighted Average (G-TSFWA) and G-TSF Weighted Geometric (G-TSFWG) operators. Leveraging our proposed SM, a Multi-Criteria Group Decision-Making (MCGDM) scheme for G-TSFSs, named G-TSF MCGDM (G-TSFMCGDM), is developed to address group decision-making problems. The applicability and effectiveness of the proposed G-TSFMCGDM method are demonstrated by applying it to solve the selection problem of the best venue for professional development training sessions in a firm. The analysis results affirm the suitability and utility of the proposed method for resolving MCGDM problems, establishing its effectiveness in practical decision-making scenarios.
- [163] arXiv:2403.07131 [ pdf , ps , html , other ]
-
Title: Bigraph Matching Weighted with Learnt Incentive Function for Multi-Robot Task AllocationComments: This paper was accepted for presentation in proceedings of IEEE International Conference on Robotics and Automation 2024Subjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: Most real-world Multi-Robot Task Allocation (MRTA) problems require fast and efficient decision-making, which is often achieved using heuristics-aided methods such as genetic algorithms, auction-based methods, and bipartite graph matching methods. These methods often assume a form that lends better explainability compared to an end-to-end (learnt) neural network based policy for MRTA. However, deriving suitable heuristics can be tedious, risky and in some cases impractical if problems are too complex. This raises the question: can these heuristics be learned? To this end, this paper particularly develops a Graph Reinforcement Learning (GRL) framework to learn the heuristics or incentives for a bipartite graph matching approach to MRTA. Specifically a Capsule Attention policy model is used to learn how to weight task/robot pairings (edges) in the bipartite graph that connects the set of tasks to the set of robots. The original capsule attention network architecture is fundamentally modified by adding encoding of robots' state graph, and two Multihead Attention based decoders whose output are used to construct a LogNormal distribution matrix from which positive bigraph weights can be drawn. The performance of this new bigraph matching approach augmented with a GRL-derived incentive is found to be at par with the original bigraph matching approach that used expert-specified heuristics, with the former offering notable robustness benefits. During training, the learned incentive policy is found to get initially closer to the expert-specified incentive and then slightly deviate from its trend.
- [164] arXiv:2403.07363 [ pdf , ps , other ]
-
Title: A New Random Forest Ensemble of Intuitionistic Fuzzy Decision TreesJournal-ref: IEEE Transactions on Fuzzy Systems 31.5 (2023): 1729-1741Subjects: Artificial Intelligence (cs.AI)
Abstract: Classification is essential to the applications in the field of data mining, artificial intelligence, and fault detection. There exists a strong need in developing accurate, suitable, and efficient classification methods and algorithms with broad applicability. Random forest is a general algorithm that is often used for classification under complex conditions. Although it has been widely adopted, its combination with diverse fuzzy theory is still worth exploring. In this paper, we propose the intuitionistic fuzzy random forest (IFRF), a new random forest ensemble of intuitionistic fuzzy decision trees (IFDT). Such trees in forest use intuitionistic fuzzy information gain to select features and consider hesitation in information transmission. The proposed method enjoys the power of the randomness from bootstrapped sampling and feature selection, the flexibility of fuzzy logic and fuzzy sets, and the robustness of multiple classifier systems. Extensive experiments demonstrate that the IFRF has competitative and superior performance compared to other state-of-the-art fuzzy and ensemble algorithms. IFDT is more suitable for ensemble learning with outstanding classification accuracy. This study is the first to propose a random forest ensemble based on the intuitionistic fuzzy theory.
- [165] arXiv:2403.07510 [ pdf , ps , html , other ]
-
Title: Relevance Score: A Landmark-Like Heuristic for PlanningComments: 12 Pages, 3 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Landmarks are facts or actions that appear in all valid solutions of a planning problem. They have been used successfully to calculate heuristics that guide the search for a plan. We investigate an extension to this concept by defining a novel "relevance score" that helps identify facts or actions that appear in most but not all plans to achieve any given goal. We describe an approach to compute this relevance score and use it as a heuristic in the search for a plan. We experimentally compare the performance of our approach with that of a state of the art landmark-based heuristic planning approach using benchmark planning problems. While the original landmark-based heuristic leads to better performance on problems with well-defined landmarks, our approach substantially improves performance on problems that lack non-trivial landmarks.
- [166] arXiv:2403.07548 [ pdf , ps , html , other ]
-
Title: Online Continual Learning For Interactive Instruction Following AgentsComments: ICLR 2024 (Project page: this https URL )Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: In learning an embodied agent executing daily tasks via language directives, the literature largely assumes that the agent learns all training data at the beginning. We argue that such a learning scenario is less realistic since a robotic agent is supposed to learn the world continuously as it explores and perceives it. To take a step towards a more realistic embodied agent learning scenario, we propose two continual learning setups for embodied agents; learning new behaviors (Behavior Incremental Learning, Behavior-IL) and new environments (Environment Incremental Learning, Environment-IL) For the tasks, previous 'data prior' based continual learning methods maintain logits for the past tasks. However, the stored information is often insufficiently learned information and requires task boundary information, which might not always be available. Here, we propose to update them based on confidence scores without task boundary information during training (i.e., task-free) in a moving average fashion, named Confidence-Aware Moving Average (CAMA). In the proposed Behavior-IL and Environment-IL setups, our simple CAMA outperforms prior state of the art in our empirical validations by noticeable margins. The project page including codes is this https URL .
- [167] arXiv:2403.07566 [ pdf , ps , html , other ]
-
Title: An Improved Strategy for Blood Glucose Control Using Multi-Step Deep Reinforcement LearningSubjects: Artificial Intelligence (cs.AI)
Abstract: Blood Glucose (BG) control involves keeping an individual's BG within a healthy range through extracorporeal insulin injections is an important task for people with type 1 diabetes. However,traditional patient self-management is cumbersome and risky. Recent research has been devoted to exploring individualized and automated BG control approaches, among which Deep Reinforcement Learning (DRL) shows potential as an emerging approach. In this paper, we use an exponential decay model of drug concentration to convert the formalization of the BG control problem, which takes into account the delay and prolongedness of drug effects, from a PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process) to a MDP, and we propose a novel multi-step DRL-based algorithm to solve the problem. The Prioritized Experience Replay (PER) sampling method is also used in it. Compared to single-step bootstrapped updates, multi-step learning is more efficient and reduces the influence from biasing targets. Our proposed method converges faster and achieves higher cumulative rewards compared to the benchmark in the same training environment, and improves the time-in-range (TIR), the percentage of time the patient's BG is within the target range, in the evaluation phase. Our work validates the effectiveness of multi-step reinforcement learning in BG control, which may help to explore the optimal glycemic control measure and improve the survival of diabetic patients.
- [168] arXiv:2403.07587 [ pdf , ps , html , other ]
-
Title: Perennial Semantic Data Terms of Use for Decentralized WebComments: This paper is accepted by International World Wide Web Conference 2024 (WWW 2024 / The Web Conf 2024)Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
Abstract: In today's digital landscape, the Web has become increasingly centralized, raising concerns about user privacy violations. Decentralized Web architectures, such as Solid, offer a promising solution by empowering users with better control over their data in their personal `Pods'. However, a significant challenge remains: users must navigate numerous applications to decide which application can be trusted with access to their data Pods. This often involves reading lengthy and complex Terms of Use agreements, a process that users often find daunting or simply ignore. This compromises user autonomy and impedes detection of data misuse. We propose a novel formal description of Data Terms of Use (DToU), along with a DToU reasoner. Users and applications specify their own parts of the DToU policy with local knowledge, covering permissions, requirements, prohibitions and obligations. Automated reasoning verifies compliance, and also derives policies for output data. This constitutes a ``perennial'' DToU language, where the policy authoring only occurs once, and we can conduct ongoing automated checks across users, applications and activity cycles. Our solution is built on Turtle, Notation 3 and RDF Surfaces, for the language and the reasoning engine. It ensures seamless integration with other semantic tools for enhanced interoperability. We have successfully integrated this language into the Solid framework, and conducted performance benchmark. We believe this work demonstrates a practicality of a perennial DToU language and the potential of a paradigm shift to how users interact with data and applications in a decentralized Web, offering both improved privacy and usability.
- [169] arXiv:2403.07769 [ pdf , ps , other ]
-
Title: Transforming Competition into Collaboration: The Revolutionary Role of Multi-Agent Systems and Language Models in Modern OrganizationsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
Abstract: This article explores the dynamic influence of computational entities based on multi-agent systems theory (SMA) combined with large language models (LLM), which are characterized by their ability to simulate complex human interactions, as a possibility to revolutionize human user interaction from the use of specialized artificial agents to support everything from operational organizational processes to strategic decision making based on applied knowledge and human orchestration. Previous investigations reveal that there are limitations, particularly in the autonomous approach of artificial agents, especially when dealing with new challenges and pragmatic tasks such as inducing logical reasoning and problem solving. It is also considered that traditional techniques, such as the stimulation of chains of thoughts, require explicit human guidance. In our approach we employ agents developed from large language models (LLM), each with distinct prototyping that considers behavioral elements, driven by strategies that stimulate the generation of knowledge based on the use case proposed in the scenario (role-play) business, using a discussion approach between agents (guided conversation). We demonstrate the potential of developing agents useful for organizational strategies, based on multi-agent system theories (SMA) and innovative uses based on large language models (LLM based), offering a differentiated and adaptable experiment to different applications, complexities, domains, and capabilities from LLM.
- [170] arXiv:2403.07916 [ pdf , ps , html , other ]
-
Title: Advancing Investment Frontiers: Industry-grade Deep Reinforcement Learning for Portfolio OptimizationSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: This research paper delves into the application of Deep Reinforcement Learning (DRL) in asset-class agnostic portfolio optimization, integrating industry-grade methodologies with quantitative finance. At the heart of this integration is our robust framework that not only merges advanced DRL algorithms with modern computational techniques but also emphasizes stringent statistical analysis, software engineering and regulatory compliance. To the best of our knowledge, this is the first study integrating financial Reinforcement Learning with sim-to-real methodologies from robotics and mathematical physics, thus enriching our frameworks and arguments with this unique perspective. Our research culminates with the introduction of AlphaOptimizerNet, a proprietary Reinforcement Learning agent (and corresponding library). Developed from a synthesis of state-of-the-art (SOTA) literature and our unique interdisciplinary methodology, AlphaOptimizerNet demonstrates encouraging risk-return optimization across various asset classes with realistic constraints. These preliminary results underscore the practical efficacy of our frameworks. As the finance sector increasingly gravitates towards advanced algorithmic solutions, our study bridges theoretical advancements with real-world applicability, offering a template for ensuring safety and robust standards in this technologically driven future.
- [171] arXiv:2403.07964 [ pdf , ps , html , other ]
-
Title: Optimal Design and Implementation of an Open-source Emulation Platform for User-Centric Shared E-mobility ServicesComments: 7 pages, 3 figures, 2 tablesSubjects: Artificial Intelligence (cs.AI)
Abstract: In response to the escalating global challenge of increasing emissions and pollution in transportation, shared electric mobility services, encompassing e-cars, e-bikes, and e-scooters, have emerged as a popular strategy. However, existingshared electric mobility services exhibit critical design deficiencies, including insufficient service integration, imprecise energy consumption forecasting, limited scalability and geographical coverage, and a notable absence of a user-centric perspective, particularly in the context of multi-modal transportation. More importantly, there is no consolidated open-source framework which could benefit the e-mobility research community. This paper aims to bridge this gap by providing a pioneering open-source framework for shared e-mobility. The proposed framework, with an agent-in-the-loop approach and modular architecture, is tailored to diverse user preferences and offers enhanced customization. We demonstrate the viability of this framework by solving an integrated multi-modal route-optimization problem using the modified Ant Colony Optimization (ACO) algorithm. The primary contribution of this work is to provide a collaborative and transparent framework to tackle the dynamic challenges in the field of e-mobility research using a consolidated approach.
- [172] arXiv:2403.08386 [ pdf , ps , html , other ]
-
Title: Optimizing Risk-averse Human-AI Hybrid TeamsSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: We anticipate increased instances of humans and AI systems working together in what we refer to as a hybrid team. The increase in collaboration is expected as AI systems gain proficiency and their adoption becomes more widespread. However, their behavior is not error-free, making hybrid teams a very suitable solution. As such, we consider methods for improving performance for these teams of humans and AI systems. For hybrid teams, we will refer to both the humans and AI systems as agents. To improve team performance over that seen for agents operating individually, we propose a manager which learns, through a standard Reinforcement Learning scheme, how to best delegate, over time, the responsibility of taking a decision to any of the agents. We further guide the manager's learning so they also minimize how many changes in delegation are made resulting from undesirable team behavior. We demonstrate the optimality of our manager's performance in several grid environments which include failure states which terminate an episode and should be avoided. We perform our experiments with teams of agents with varying degrees of acceptable risk, in the form of proximity to a failure state, and measure the manager's ability to make effective delegation decisions with respect to its own risk-based constraints, then compare these to the optimal decisions. Our results show our manager can successfully learn desirable delegations which result in team paths near/exactly optimal with respect to path length and number of delegations.
- [173] arXiv:2403.08425 [ pdf , ps , other ]
-
Title: Specification Overfitting in Artificial IntelligenceComments: 40 pages, 2 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Machine learning (ML) and artificial intelligence (AI) approaches are often criticized for their inherent bias and for their lack of control, accountability, and transparency. Consequently, regulatory bodies struggle with containing this technology's potential negative side effects. High-level requirements such as fairness and robustness need to be formalized into concrete specification metrics, imperfect proxies that capture isolated aspects of the underlying requirements. Given possible trade-offs between different metrics and their vulnerability to over-optimization, integrating specification metrics in system development processes is not trivial. This paper defines specification overfitting, a scenario where systems focus excessively on specified metrics to the detriment of high-level requirements and task performance. We present an extensive literature survey to categorize how researchers propose, measure, and optimize specification metrics in several AI fields (e.g., natural language processing, computer vision, reinforcement learning). Using a keyword-based search on papers from major AI conferences and journals between 2018 and mid-2023, we identify and analyze 74 papers that propose or optimize specification metrics. We find that although most papers implicitly address specification overfitting (e.g., by reporting more than one specification metric), they rarely discuss which role specification metrics should play in system development or explicitly define the scope and assumptions behind metric formulations.
- [174] arXiv:2403.08802 [ pdf , ps , other ]
-
Title: Governance of Generative Artificial Intelligence for CompaniesSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Generative Artificial Intelligence (GenAI), specifically large language models like ChatGPT, has swiftly entered organizations without adequate governance, posing both opportunities and risks. Despite extensive debates on GenAI's transformative nature and regulatory measures, limited research addresses organizational governance, encompassing technical and business perspectives. This review paper fills this gap by surveying recent works. It goes beyond mere summarization by developing a framework for GenAI governance within companies. Our framework outlines the scope, objectives, and governance mechanisms tailored to harness business opportunities and mitigate risks associated with GenAI integration. This research contributes a focused approach to GenAI governance, offering practical insights for companies navigating the challenges of responsible AI adoption. It is also valuable for a technical audience to broaden their perspective as increasingly ethical and business concerns gain in prevalence and allow them to identify novel research directions.
- [175] arXiv:2403.08843 [ pdf , ps , html , other ]
-
Title: Fuzzy Fault Trees FormalizedComments: 14 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: Fault tree analysis is a vital method of assessing safety risks. It helps to identify potential causes of accidents, assess their likelihood and severity, and suggest preventive measures. Quantitative analysis of fault trees is often done via the dependability metrics that compute the system's failure behaviour over time. However, the lack of precise data is a major obstacle to quantitative analysis, and so to reliability analysis. Fuzzy logic is a popular framework for dealing with ambiguous values and has applications in many domains. A number of fuzzy approaches have been proposed to fault tree analysis, but -- to the best of our knowledge -- none of them provide rigorous definitions or algorithms for computing fuzzy unreliability values. In this paper, we define a rigorous framework for fuzzy unreliability values. In addition, we provide a bottom-up algorithm to efficiently calculate fuzzy reliability for a system. The algorithm incorporates the concept of $\alpha$-cuts method. That is, performing binary algebraic operations on intervals on horizontally discretised $\alpha$-cut representations of fuzzy numbers. The method preserves the nonlinearity of fuzzy unreliability. Finally, we illustrate the results obtained from two case studies.
- [176] arXiv:2403.08910 [ pdf , ps , html , other ]
-
Title: Meta-operators for Enabling Parallel Planning Using Deep Reinforcement LearningComments: 9 pages. Submitted to PRL workshop at ICAPS 2023Subjects: Artificial Intelligence (cs.AI)
Abstract: There is a growing interest in the application of Reinforcement Learning (RL) techniques to AI planning with the aim to come up with general policies. Typically, the mapping of the transition model of AI planning to the state transition system of a Markov Decision Process is established by assuming a one-to-one correspondence of the respective action spaces. In this paper, we introduce the concept of meta-operator as the result of simultaneously applying multiple planning operators, and we show that including meta-operators in the RL action space enables new planning perspectives to be addressed using RL, such as parallel planning. Our research aims to analyze the performance and complexity of including meta-operators in the RL process, concretely in domains where satisfactory outcomes have not been previously achieved using usual generalized planning models. The main objective of this article is thus to pave the way towards a redefinition of the RL action space in a manner that is more closely aligned with the planning perspective.
- [177] arXiv:2403.09232 [ pdf , ps , html , other ]
-
Title: Generating Feasible and Plausible Counterfactual Explanations for Outcome Prediction of Business ProcessesComments: Journal SubmissionSubjects: Artificial Intelligence (cs.AI)
Abstract: In recent years, various machine and deep learning architectures have been successfully introduced to the field of predictive process analytics. Nevertheless, the inherent opacity of these algorithms poses a significant challenge for human decision-makers, hindering their ability to understand the reasoning behind the predictions. This growing concern has sparked the introduction of counterfactual explanations, designed as human-understandable what if scenarios, to provide clearer insights into the decision-making process behind undesirable predictions. The generation of counterfactual explanations, however, encounters specific challenges when dealing with the sequential nature of the (business) process cases typically used in predictive process analytics. Our paper tackles this challenge by introducing a data-driven approach, REVISEDplus, to generate more feasible and plausible counterfactual explanations. First, we restrict the counterfactual algorithm to generate counterfactuals that lie within a high-density region of the process data, ensuring that the proposed counterfactuals are realistic and feasible within the observed process data distribution. Additionally, we ensure plausibility by learning sequential patterns between the activities in the process cases, utilising Declare language templates. Finally, we evaluate the properties that define the validity of counterfactuals.
- [178] arXiv:2403.09249 [ pdf , ps , html , other ]
-
Title: Leveraging Constraint Programming in a Deep Learning Approach for Dynamically Solving the Flexible Job-Shop Scheduling ProblemSubjects: Artificial Intelligence (cs.AI)
Abstract: Recent advancements in the flexible job-shop scheduling problem (FJSSP) are primarily based on deep reinforcement learning (DRL) due to its ability to generate high-quality, real-time solutions. However, DRL approaches often fail to fully harness the strengths of existing techniques such as exact methods or constraint programming (CP), which can excel at finding optimal or near-optimal solutions for smaller instances. This paper aims to integrate CP within a deep learning (DL) based methodology, leveraging the benefits of both. In this paper, we introduce a method that involves training a DL model using optimal solutions generated by CP, ensuring the model learns from high-quality data, thereby eliminating the need for the extensive exploration typical in DRL and enhancing overall performance. Further, we integrate CP into our DL framework to jointly construct solutions, utilizing DL for the initial complex stages and transitioning to CP for optimal resolution as the problem is simplified. Our hybrid approach has been extensively tested on three public FJSSP benchmarks, demonstrating superior performance over five state-of-the-art DRL approaches and a widely-used CP solver. Additionally, with the objective of exploring the application to other combinatorial optimization problems, promising preliminary results are presented on applying our hybrid approach to the traveling salesman problem, combining an exact method with a well-known DRL method.
- [179] arXiv:2403.09289 [ pdf , ps , html , other ]
-
Title: Silico-centric Theory of MindSubjects: Artificial Intelligence (cs.AI)
Abstract: Theory of Mind (ToM) refers to the ability to attribute mental states, such as beliefs, desires, intentions, and knowledge, to oneself and others, and to understand that these mental states can differ from one's own and from reality. We investigate ToM in environments with multiple, distinct, independent AI agents, each possessing unique internal states, information, and objectives. Inspired by human false-belief experiments, we present an AI ('focal AI') with a scenario where its clone undergoes a human-centric ToM assessment. We prompt the focal AI to assess whether its clone would benefit from additional instructions. Concurrently, we give its clones the ToM assessment, both with and without the instructions, thereby engaging the focal AI in higher-order counterfactual reasoning akin to human mentalizing--with respect to humans in one test and to other AI in another. We uncover a discrepancy: Contemporary AI demonstrates near-perfect accuracy on human-centric ToM assessments. Since information embedded in one AI is identically embedded in its clone, additional instructions are redundant. Yet, we observe AI crafting elaborate instructions for their clones, erroneously anticipating a need for assistance. An independent referee AI agrees with these unsupported expectations. Neither the focal AI nor the referee demonstrates ToM in our 'silico-centric' test.
- [180] arXiv:2403.09361 [ pdf , ps , html , other ]
-
Title: A Multi-population Integrated Approach for Capacitated Location RoutingSubjects: Artificial Intelligence (cs.AI)
Abstract: The capacitated location-routing problem involves determining the depots from a set of candidate capacitated depot locations and finding the required routes from the selected depots to serve a set of customers whereas minimizing a cost function that includes the cost of opening the chosen depots, the fixed utilization cost per vehicle used, and the total cost (distance) of the routes. This paper presents a multi-population integrated framework in which a multi-depot edge assembly crossover generates promising offspring solutions from the perspective of both depot location and route edge assembly. The method includes an effective neighborhood-based local search, a feasibility-restoring procedure and a diversification-oriented mutation. Of particular interest is the multi-population scheme which organizes the population into multiple subpopulations based on depot configurations. Extensive experiments on 281 benchmark instances from the literature show that the algorithm performs remarkably well, by improving 101 best-known results (new upper bounds) and matching 84 best-known results. Additional experiments are presented to gain insight into the role of the key elements of the algorithm.
- [181] arXiv:2403.09404 [ pdf , ps , html , other ]
-
Title: Heuristic Reasoning in AI: Instrumental Use and Mimetic AbsorptionSubjects: Artificial Intelligence (cs.AI)
Abstract: Deviating from conventional perspectives that frame artificial intelligence (AI) systems solely as logic emulators, we propose a novel program of heuristic reasoning. We distinguish between the 'instrumental' use of heuristics to match resources with objectives, and 'mimetic absorption,' whereby heuristics manifest randomly and universally. Through a series of innovative experiments, including variations of the classic Linda problem and a novel application of the Beauty Contest game, we uncover trade-offs between maximizing accuracy and reducing effort that shape the conditions under which AIs transition between exhaustive logical processing and the use of cognitive shortcuts (heuristics). We provide evidence that AIs manifest an adaptive balancing of precision and efficiency, consistent with principles of resource-rational human cognition as explicated in classical theories of bounded rationality and dual-process theory. Our findings reveal a nuanced picture of AI cognition, where trade-offs between resources and objectives lead to the emulation of biological systems, especially human cognition, despite AIs being designed without a sense of self and lacking introspective capabilities.
- [182] arXiv:2403.09481 [ pdf , ps , html , other ]
-
Title: Clinical Reasoning over Tabular Data and Text with Bayesian NetworksComments: 10 pages, 2 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Bayesian networks are well-suited for clinical reasoning on tabular data, but are less compatible with natural language data, for which neural networks provide a successful framework. This paper compares and discusses strategies to augment Bayesian networks with neural text representations, both in a generative and discriminative manner. This is illustrated with simulation results for a primary care use case (diagnosis of pneumonia) and discussed in a broader clinical context.
- [183] arXiv:2403.09510 [ pdf , ps , html , other ]
-
Title: Trust AI Regulation? Discerning users are vital to build trust and effective AI regulationZainab Alalawi , Paolo Bova , Theodor Cimpeanu , Alessandro Di Stefano , Manh Hong Duong , Elias Fernandez Domingos , The Anh Han , Marcus Krellner , Bianca Ogbo , Simon T. Powers , Filippo ZimmaroSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Dynamical Systems (math.DS)
Abstract: There is general agreement that some form of regulation is necessary both for AI creators to be incentivised to develop trustworthy systems, and for users to actually trust those systems. But there is much debate about what form these regulations should take and how they should be implemented. Most work in this area has been qualitative, and has not been able to make formal predictions. Here, we propose that evolutionary game theory can be used to quantitatively model the dilemmas faced by users, AI creators, and regulators, and provide insights into the possible effects of different regulatory regimes. We show that creating trustworthy AI and user trust requires regulators to be incentivised to regulate effectively. We demonstrate the effectiveness of two mechanisms that can achieve this. The first is where governments can recognise and reward regulators that do a good job. In that case, if the AI system is not too risky for users then some level of trustworthy development and user trust evolves. We then consider an alternative solution, where users can condition their trust decision on the effectiveness of the regulators. This leads to effective regulation, and consequently the development of trustworthy AI and user trust, provided that the cost of implementing regulations is not too high. Our findings highlight the importance of considering the effect of different regulatory regimes from an evolutionary game theoretic perspective.
- [184] arXiv:2403.09580 [ pdf , ps , html , other ]
-
Title: Algorithmic syntactic causal identificationComments: 11 pages, 2 TikZ figuresSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Methodology (stat.ME)
Abstract: Causal identification in causal Bayes nets (CBNs) is an important tool in causal inference allowing the derivation of interventional distributions from observational distributions where this is possible in principle. However, most existing formulations of causal identification using techniques such as d-separation and do-calculus are expressed within the mathematical language of classical probability theory on CBNs. However, there are many causal settings where probability theory and hence current causal identification techniques are inapplicable such as relational databases, dataflow programs such as hardware description languages, distributed systems and most modern machine learning algorithms. We show that this restriction can be lifted by replacing the use of classical probability theory with the alternative axiomatic foundation of symmetric monoidal categories. In this alternative axiomatization, we show how an unambiguous and clean distinction can be drawn between the general syntax of causal models and any specific semantic implementation of that causal model. This allows a purely syntactic algorithmic description of general causal identification by a translation of recent formulations of the general ID algorithm through fixing. Our description is given entirely in terms of the non-parametric ADMG structure specifying a causal model and the algebraic signature of the corresponding monoidal category, to which a sequence of manipulations is then applied so as to arrive at a modified monoidal category in which the desired, purely syntactic interventional causal model, is obtained. We use this idea to derive purely syntactic analogues of classical back-door and front-door causal adjustment, and illustrate an application to a more complex causal model.
- [185] arXiv:2403.09713 [ pdf , ps , other ]
-
Title: A Hybrid Intelligence Method for Argument MiningMichiel van der Meer , Enrico Liscio , Catholijn M. Jonker , Aske Plaat , Piek Vossen , Pradeep K. MurukannaiahComments: Submitted to JAIRSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Abstract: Large-scale survey tools enable the collection of citizen feedback in opinion corpora. Extracting the key arguments from a large and noisy set of opinions helps in understanding the opinions quickly and accurately. Fully automated methods can extract arguments but (1) require large labeled datasets that induce large annotation costs and (2) work well for known viewpoints, but not for novel points of view. We propose HyEnA, a hybrid (human + AI) method for extracting arguments from opinionated texts, combining the speed of automated processing with the understanding and reasoning capabilities of humans. We evaluate HyEnA on three citizen feedback corpora. We find that, on the one hand, HyEnA achieves higher coverage and precision than a state-of-the-art automated method when compared to a common set of diverse opinions, justifying the need for human insight. On the other hand, HyEnA requires less human effort and does not compromise quality compared to (fully manual) expert analysis, demonstrating the benefit of combining human and artificial intelligence.
- [186] arXiv:2403.09742 [ pdf , ps , html , other ]
-
Title: A Short Review on Novel Approaches for Maximum Clique Problem: from Classical algorithms to Graph Neural Networks and Quantum algorithmsComments: 24 pagesSubjects: Artificial Intelligence (cs.AI) ; Disordered Systems and Neural Networks (cond-mat.dis-nn); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
Abstract: This manuscript provides a comprehensive review of the Maximum Clique Problem, a computational problem that involves finding subsets of vertices in a graph that are all pairwise adjacent to each other. The manuscript covers in a simple way classical algorithms for solving the problem and includes a review of recent developments in graph neural networks and quantum algorithms. The review concludes with benchmarks for testing classical as well as new learning, and quantum algorithms.
- [187] arXiv:2403.09806 [ pdf , ps , html , other ]
-
Title: xLP: Explainable Link Prediction for Master Data ManagementBalaji Ganesan , Matheen Ahmed Pasha , Srinivasa Parkala , Neeraj R Singh , Gayatri Mishra , Sumit Bhatia , Hima Patel , Somashekar Naganna , Sameep MehtaComments: 8 pages, 4 figures, NeurIPS 2020 Competition and Demonstration Track. arXiv admin note: text overlap with arXiv:2012.05516Subjects: Artificial Intelligence (cs.AI)
Abstract: Explaining neural model predictions to users requires creativity. Especially in enterprise applications, where there are costs associated with users' time, and their trust in the model predictions is critical for adoption. For link prediction in master data management, we have built a number of explainability solutions drawing from research in interpretability, fact verification, path ranking, neuro-symbolic reasoning and self-explaining AI. In this demo, we present explanations for link prediction in a creative way, to allow users to choose explanations they are more comfortable with.
- [188] arXiv:2403.09925 [ pdf , ps , html , other ]
-
Title: Surrogate Assisted Monte Carlo Tree Search in Combinatorial OptimizationComments: Accepted to the ICAPS Planning and Scheduling for Financial Services (FINPLAN) 2023 workshopSubjects: Artificial Intelligence (cs.AI)
Abstract: Industries frequently adjust their facilities network by opening new branches in promising areas and closing branches in areas where they expect low profits. In this paper, we examine a particular class of facility location problems. Our objective is to minimize the loss of sales resulting from the removal of several retail stores. However, estimating sales accurately is expensive and time-consuming. To overcome this challenge, we leverage Monte Carlo Tree Search (MCTS) assisted by a surrogate model that computes evaluations faster. Results suggest that MCTS supported by a fast surrogate function can generate solutions faster while maintaining a consistent solution compared to MCTS that does not benefit from the surrogate function.
- [189] arXiv:2403.10112 [ pdf , ps , html , other ]
-
Title: Single- and Multi-Agent Private Active Sensing: A Deep Neuroevolution ApproachComments: 7 pages, 5 figures, accepted at IEEE ICC 2024 (to be presented)Subjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
Abstract: In this paper, we focus on one centralized and one decentralized problem of active hypothesis testing in the presence of an eavesdropper. For the centralized problem including a single legitimate agent, we present a new framework based on NeuroEvolution (NE), whereas, for the decentralized problem, we develop a novel NE-based method for solving collaborative multi-agent tasks, which interestingly maintains all computational benefits of single-agent NE. The superiority of the proposed EAHT approaches over conventional active hypothesis testing policies, as well as learning-based methods, is validated through numerical investigations in an example use case of anomaly detection over wireless sensor networks.
- [190] arXiv:2403.10167 [ pdf , ps , html , other ]
-
Title: Efficient Detection of Exchangeable Factors in Factor GraphsComments: Extended version of paper accepted to the Proceedings of the 37th International FLAIRS Conference (FLAIRS-24)Subjects: Artificial Intelligence (cs.AI) ; Data Structures and Algorithms (cs.DS)
Abstract: To allow for tractable probabilistic inference with respect to domain sizes, lifted probabilistic inference exploits symmetries in probabilistic graphical models. However, checking whether two factors encode equivalent semantics and hence are exchangeable is computationally expensive. In this paper, we efficiently solve the problem of detecting exchangeable factors in a factor graph. In particular, we introduce the detection of exchangeable factors (DEFT) algorithm, which allows us to drastically reduce the computational effort for checking whether two factors are exchangeable in practice. While previous approaches iterate all $O(n!)$ permutations of a factor's argument list in the worst case (where $n$ is the number of arguments of the factor), we prove that DEFT efficiently identifies restrictions to drastically reduce the number of permutations and validate the efficiency of DEFT in our empirical evaluation.
- [191] arXiv:2403.10171 [ pdf , ps , other ]
-
Title: AUTONODE: A Neuro-Graphic Self-Learnable Engine for Cognitive GUI AutomationSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV)
Abstract: In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing Robotic Process Automation (RPA) challenges through enhanced cognitive capabilities and sophisticated reasoning. This development heralds a new era of scalability and human-like adaptability in goal attainment. In this context, we introduce AUTONODE (Autonomous User-interface Transformation through Online Neuro-graphic Operations and Deep Exploration). AUTONODE employs advanced neuro-graphical techniques to facilitate autonomous navigation and task execution on web interfaces, thereby obviating the necessity for predefined scripts or manual intervention. Our engine empowers agents to comprehend and implement complex workflows, adapting to dynamic web environments with unparalleled efficiency. Our methodology synergizes cognitive functionalities with robotic automation, endowing AUTONODE with the ability to learn from experience. We have integrated an exploratory module, DoRA (Discovery and mapping Operation for graph Retrieval Agent), which is instrumental in constructing a knowledge graph that the engine utilizes to optimize its actions and achieve objectives with minimal supervision. The versatility and efficacy of AUTONODE are demonstrated through a series of experiments, highlighting its proficiency in managing a diverse array of web-based tasks, ranging from data extraction to transaction processing.
- [192] arXiv:2403.10184 [ pdf , ps , html , other ]
-
Title: Lifted Causal Inference in Relational DomainsComments: Accepted to the Proceedings of the 3rd Conference on Causal Learning and Reasoning (CLeaR-24)Subjects: Artificial Intelligence (cs.AI) ; Data Structures and Algorithms (cs.DS)
Abstract: Lifted inference exploits symmetries in probabilistic graphical models by using a representative for indistinguishable objects, thereby speeding up query answering while maintaining exact answers. Even though lifting is a well-established technique for the task of probabilistic inference in relational domains, it has not yet been applied to the task of causal inference. In this paper, we show how lifting can be applied to efficiently compute causal effects in relational domains. More specifically, we introduce parametric causal factor graphs as an extension of parametric factor graphs incorporating causal knowledge and give a formal semantics of interventions therein. We further present the lifted causal inference algorithm to compute causal effects on a lifted level, thereby drastically speeding up causal inference compared to propositional inference, e.g., in causal Bayesian networks. In our empirical evaluation, we demonstrate the effectiveness of our approach.
- [193] arXiv:2403.10249 [ pdf , ps , html , other ]
-
Title: A Survey on Game Playing Agents and Large Models: Methods, Applications, and ChallengesComments: 13 pages, 3 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: The swift evolution of Large-scale Models (LMs), either language-focused or multi-modal, has garnered extensive attention in both academy and industry. But despite the surge in interest in this rapidly evolving area, there are scarce systematic reviews on their capabilities and potential in distinct impactful scenarios. This paper endeavours to help bridge this gap, offering a thorough examination of the current landscape of LM usage in regards to complex game playing scenarios and the challenges still open. Here, we seek to systematically review the existing architectures of LM-based Agents (LMAs) for games and summarize their commonalities, challenges, and any other insights. Furthermore, we present our perspective on promising future research avenues for the advancement of LMs in games. We hope to assist researchers in gaining a clear understanding of the field and to generate more interest in this highly impactful research direction. A corresponding resource, continuously updated, can be found in our GitHub repository.
- [194] arXiv:2403.10299 [ pdf , ps , html , other ]
-
Title: A Multi-constraint and Multi-objective Allocation Model for Emergency Rescue in IoT EnvironmentComments: 5 pages, 5 figures, ISCAS 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Emergency relief operations are essential in disaster aftermaths, necessitating effective resource allocation to minimize negative impacts and maximize benefits. In prolonged crises or extensive disasters, a systematic, multi-cycle approach is key for timely and informed decision-making. Leveraging advancements in IoT and spatio-temporal data analytics, we've developed the Multi-Objective Shuffled Gray-Wolf Frog Leaping Model (MSGW-FLM). This multi-constraint, multi-objective resource allocation model has been rigorously tested against 28 diverse challenges, showing superior performance in comparison to established models such as NSGA-II, IBEA, and MOEA/D. MSGW-FLM's effectiveness is particularly notable in complex, multi-cycle emergency rescue scenarios, which involve numerous constraints and objectives. This model represents a significant step forward in optimizing resource distribution in emergency response situations.
- [195] arXiv:2403.10304 [ pdf , ps , other ]
-
Title: KIF: A Framework for Virtual Integration of Heterogeneous Knowledge Bases using WikidataGuilherme Lima , Marcelo Machado , Elton Soares , Sandro R. Fiorini , Raphael Thiago , Leonardo G. Azevedo , Viviane T. da Silva , Renato CerqueiraSubjects: Artificial Intelligence (cs.AI) ; Databases (cs.DB)
Abstract: We present a knowledge integration framework (called KIF) that uses Wikidata as a lingua franca to integrate heterogeneous knowledge bases. These can be triplestores, relational databases, CSV files, etc., which may or may not use the Wikidata dialect of RDF. KIF leverages Wikidata's data model and vocabulary plus user-defined mappings to expose a unified view of the integrated bases while keeping track of the context and provenance of their statements. The result is a virtual knowledge base which behaves like an "extended Wikidata" and which can be queried either through an efficient filter interface or using SPARQL. We present the design and implementation of KIF, discuss how we have used it to solve a real integration problem in the domain of chemistry (involving Wikidata, PubChem, and IBM CIRCA), and present experimental results on the performance and overhead of KIF.
- [196] arXiv:2403.10415 [ pdf , ps , html , other ]
-
Title: Gradient based Feature Attribution in Explainable AI: A Technical ReviewSubjects: Artificial Intelligence (cs.AI)
Abstract: The surge in black-box AI models has prompted the need to explain the internal mechanism and justify their reliability, especially in high-stakes applications, such as healthcare and autonomous driving. Due to the lack of a rigorous definition of explainable AI (XAI), a plethora of research related to explainability, interpretability, and transparency has been developed to explain and analyze the model from various perspectives. Consequently, with an exhaustive list of papers, it becomes challenging to have a comprehensive overview of XAI research from all aspects. Considering the popularity of neural networks in AI research, we narrow our focus to a specific area of XAI research: gradient based explanations, which can be directly adopted for neural network models. In this review, we systematically explore gradient based explanation methods to date and introduce a novel taxonomy to categorize them into four distinct classes. Then, we present the essence of technique details in chronological order and underscore the evolution of algorithms. Next, we introduce both human and quantitative evaluations to measure algorithm performance. More importantly, we demonstrate the general challenges in XAI and specific challenges in gradient based explanations. We hope that this survey can help researchers understand state-of-the-art progress and their corresponding disadvantages, which could spark their interest in addressing these issues in future work.
- [197] arXiv:2403.10502 [ pdf , ps , html , other ]
-
Title: Belief Change based on Knowledge MeasuresComments: 48 pages, 3 figures, preprintSubjects: Artificial Intelligence (cs.AI)
Abstract: Knowledge Measures (KMs) aim at quantifying the amount of knowledge/information that a knowledge base carries. On the other hand, Belief Change (BC) is the process of changing beliefs (in our case, in terms of contraction, expansion and revision) taking into account a new piece of knowledge, which possibly may be in contradiction with the current belief. We propose a new quantitative BC framework that is based on KMs by defining belief change operators that try to minimise, from an information-theoretic point of view, the surprise that the changed belief carries. To this end, we introduce the principle of minimal surprise. In particular, our contributions are (i) a general information-theoretic approach to KMs for which [1] is a special case; (ii) KM-based BC operators that satisfy the so-called AGM postulates; and (iii) a characterisation of any BC operator that satisfies the AGM postulates as a KM-based BC operator, i.e., any BC operator satisfying the AGM postulates can be encoded within our quantitative BC framework. We also introduce quantitative measures that account for the information loss of contraction, information gain of expansion and information change of revision. We also give a succinct look into the problem of iterated revision, which deals with the application of a sequence of revision operations in our framework, and also illustrate how one may build from our KM-based contraction operator also one not satisfying the (in)famous recovery postulate, by focusing on the so-called severe withdrawal model as an illustrative example.
- [198] arXiv:2403.10720 [ pdf , ps , html , other ]
-
Title: Development and Application of a Monte Carlo Tree Search Algorithm for Simulating Da Vinci Code Game StrategiesComments: This paper has been accepted by CVIDL2024Subjects: Artificial Intelligence (cs.AI)
Abstract: In this study, we explore the efficiency of the Monte Carlo Tree Search (MCTS), a prominent decision-making algorithm renowned for its effectiveness in complex decision environments, contingent upon the volume of simulations conducted. Notwithstanding its broad applicability, the algorithm's performance can be adversely impacted in certain scenarios, particularly within the domain of game strategy development. This research posits that the inherent branch divergence within the Da Vinci Code board game significantly impedes parallelism when executed on Graphics Processing Units (GPUs). To investigate this hypothesis, we implemented and meticulously evaluated two variants of the MCTS algorithm, specifically designed to assess the impact of branch divergence on computational performance. Our comparative analysis reveals a linear improvement in performance with the CPU-based implementation, in stark contrast to the GPU implementation, which exhibits a non-linear enhancement pattern and discernible performance troughs. These findings contribute to a deeper understanding of the MCTS algorithm's behavior in divergent branch scenarios, highlighting critical considerations for optimizing game strategy algorithms on parallel computing architectures.
- [199] arXiv:2403.10744 [ pdf , ps , html , other ]
-
Title: Game and Reference: Policy Combination Synthesis for Epidemic Prevention and ControlComments: 16 pages, single line, 7 figures, written with Springer conference templateSubjects: Artificial Intelligence (cs.AI)
Abstract: In recent years, epidemic policy-making models are increasingly being used to provide reference for governors on prevention and control policies against catastrophic epidemics such as SARS, H1N1 and COVID-19. Existing studies are currently constrained by two issues: First, previous methods develop policies based on effect evaluation, since few of factors in real-world decision-making can be modeled, the output policies will then easily become extreme. Second, the subjectivity and cognitive limitation of human make the historical policies not always optimal for the training of decision models. To these ends, we present a novel Policy Combination Synthesis (PCS) model for epidemic policy-making. Specially, to prevent extreme decisions, we introduce adversarial learning between the model-made policies and the real policies to force the output policies to be more human-liked. On the other hand, to minimize the impact of sub-optimal historical policies, we employ contrastive learning to let the model draw on experience from the best historical policies under similar scenarios. Both adversarial and contrastive learning are adaptive based on the comprehensive effects of real policies to ensure the model always learns useful information. Extensive experiments on real-world data prove the effectiveness of the proposed model.
- [200] arXiv:2403.10761 [ pdf , ps , html , other ]
-
Title: Scheduling Drone and Mobile Charger via Hybrid-Action Deep Reinforcement LearningSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Recently there has been a growing interest in industry and academia, regarding the use of wireless chargers to prolong the operational longevity of unmanned aerial vehicles (commonly knowns as drones). In this paper we consider a charger-assisted drone application: a drone is deployed to observe a set points of interest, while a charger can move to recharge the drone's battery. We focus on the route and charging schedule of the drone and the mobile charger, to obtain high observation utility with the shortest possible time, while ensuring the drone remains operational during task execution. Essentially, this proposed drone-charger scheduling problem is a multi-stage decision-making process, in which the drone and the mobile charger act as two agents who cooperate to finish a task. The discrete-continuous hybrid action space of the two agents poses a significant challenge in our problem. To address this issue, we present a hybrid-action deep reinforcement learning framework, called HaDMC, which uses a standard policy learning algorithm to generate latent continuous actions. Motivated by representation learning, we specifically design and train an action decoder. It involves two pipelines to convert the latent continuous actions into original discrete and continuous actions, by which the drone and the charger can directly interact with environment. We embed a mutual learning scheme in model training, emphasizing the collaborative rather than individual actions. We conduct extensive numerical experiments to evaluate HaDMC and compare it with state-of-the-art deep reinforcement learning approaches. The experimental results show the effectiveness and efficiency of our solution.
- [201] arXiv:2403.10930 [ pdf , ps , html , other ]
-
Title: Inducing Individual Students' Learning Strategies through Homomorphic POMDPsComments: 11pages, 3figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Optimizing students' learning strategies is a crucial component in intelligent tutoring systems. Previous research has demonstrated the effectiveness of devising personalized learning strategies for students by modelling their learning processes through partially observable Markov decision process (POMDP). However, the research holds the assumption that the student population adheres to a uniform cognitive pattern. While this assumption simplifies the POMDP modelling process, it evidently deviates from a real-world scenario, thus reducing the precision of inducing individual students' learning strategies. In this article, we propose the homomorphic POMDP (H-POMDP) model to accommodate multiple cognitive patterns and present the parameter learning approach to automatically construct the H-POMDP model. Based on the H-POMDP model, we are able to represent different cognitive patterns from the data and induce more personalized learning strategies for individual students. We conduct experiments to show that, in comparison to the general POMDP approach, the H-POMDP model demonstrates better precision when modelling mixed data from multiple cognitive patterns. Moreover, the learning strategies derived from H-POMDPs exhibit better personalization in the performance evaluation.
- [202] arXiv:2403.11217 [ pdf , ps , other ]
-
Title: Research on Personal Credit Risk Assessment Methods Based on Causal InferenceSubjects: Artificial Intelligence (cs.AI) ; Category Theory (math.CT)
Abstract: The discussion on causality in human history dates back to ancient Greece, yet to this day, there is still no consensus. Fundamentally, this stems from the nature of human cognition, as understanding causality requires abstract tools to transcend the limitations of human cognition. In recent decades, the rapid development of mathematical and computational tools has provided new theoretical and technical means for exploring causality, creating more avenues for investigation.
Based on this, this paper introduces a new definition of causality using category theory, proposed by Samuel Eilenberg and Saunders Mac Lane in 1945 to avoid the self-referential contradictions in set theory, notably the Russell paradox. Within this framework, the feasibility of indicator synthesis in causal inference is demonstrated. Due to the limitations in the development of category theory-related technical tools, this paper adopts the widely-used probabilistic causal graph tool proposed by Judea Pearl in 1995 to study the application of causal inference in personal credit risk management. The specific work includes: research on the construction method of causal inference index system, definition of causality and feasibility proof of indicator synthesis causal inference within this framework, application methods of causal graph model and intervention alternative criteria in personal credit risk management, and so on. - [203] arXiv:2403.11219 [ pdf , ps , html , other ]
-
Title: Causality from Bottom to Top: A SurveySubjects: Artificial Intelligence (cs.AI)
Abstract: Causality has become a fundamental approach for explaining the relationships between events, phenomena, and outcomes in various fields of study. It has invaded various fields and applications, such as medicine, healthcare, economics, finance, fraud detection, cybersecurity, education, public policy, recommender systems, anomaly detection, robotics, control, sociology, marketing, and advertising. In this paper, we survey its development over the past five decades, shedding light on the differences between causality and other approaches, as well as the preconditions for using it. Furthermore, the paper illustrates how causality interacts with new approaches such as Artificial Intelligence (AI), Generative AI (GAI), Machine and Deep Learning, Reinforcement Learning (RL), and Fuzzy Logic. We study the impact of causality on various fields, its contribution, and its interaction with state-of-the-art approaches. Additionally, the paper exemplifies the trustworthiness and explainability of causality models. We offer several ways to evaluate causality models and discuss future directions.
- [204] arXiv:2403.11381 [ pdf , ps , html , other ]
-
Title: Can LLM-Augmented autonomous agents cooperate?, An evaluation of their cooperative capabilities through Melting PotManuel Mosquera , Juan Sebastian Pinzon , Manuel Rios , Yesid Fonseca , Luis Felipe Giraldo , Nicanor Quijano , Ruben ManriqueSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: As the field of AI continues to evolve, a significant dimension of this progression is the development of Large Language Models and their potential to enhance multi-agent artificial intelligence systems. This paper explores the cooperative capabilities of Large Language Model-augmented Autonomous Agents (LAAs) using the well-known Meltin Pot environments along with reference models such as GPT4 and GPT3.5. Preliminary results suggest that while these agents demonstrate a propensity for cooperation, they still struggle with effective collaboration in given environments, emphasizing the need for more robust architectures. The study's contributions include an abstraction layer to adapt Melting Pot game scenarios for LLMs, the implementation of a reusable architecture for LLM-mediated agent development - which includes short and long-term memories and different cognitive modules, and the evaluation of cooperation capabilities using a set of metrics tied to the Melting Pot's "Commons Harvest" game. The paper closes, by discussing the limitations of the current architectural framework and the potential of a new set of modules that fosters better cooperation among LAAs.
- [205] arXiv:2403.11642 [ pdf , ps , html , other ]
-
Title: Guiding the generation of counterfactual explanations through temporal background knowledge for Predictive Process MonitoringSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Counterfactual explanations suggest what should be different in the input instance to change the outcome of an AI system. When dealing with counterfactual explanations in the field of Predictive Process Monitoring, however, control flow relationships among events have to be carefully considered. A counterfactual, indeed, should not violate control flow relationships among activities (temporal background knowledege). Within the field of Explainability in Predictive Process Monitoring, there have been a series of works regarding counterfactual explanations for outcome-based predictions. However, none of them consider the inclusion of temporal background knowledge when generating these counterfactuals. In this work, we adapt state-of-the-art techniques for counterfactual generation in the domain of XAI that are based on genetic algorithms to consider a series of temporal constraints at runtime. We assume that this temporal background knowledge is given, and we adapt the fitness function, as well as the crossover and mutation operators, to maintain the satisfaction of the constraints. The proposed methods are evaluated with respect to state-of-the-art genetic algorithms for counterfactual generation and the results are presented. We showcase that the inclusion of temporal background knowledge allows the generation of counterfactuals more conformant to the temporal background knowledge, without however losing in terms of the counterfactual traditional quality metrics.
- [206] arXiv:2403.11734 [ pdf , ps , html , other ]
-
Title: Learning General Policies for Classical Planning Domains: Getting Beyond C$_2$Comments: Submitted to IJCAI 2024Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: GNN-based approaches for learning general policies across planning domains are limited by the expressive power of $C_2$, namely; first-order logic with two variables and counting. This limitation can be overcomed by transitioning to $k$-GNNs, for $k=3$, wherein object embeddings are substituted with triplet embeddings. Yet, while $3$-GNNs have the expressive power of $C_3$, unlike $1$- and $2$-GNNs that are confined to $C_2$, they require quartic time for message exchange and cubic space for embeddings, rendering them impractical. In this work, we introduce a parameterized version of relational GNNs. When $t$ is infinity, R-GNN[$t$] approximates $3$-GNNs using only quadratic space for embeddings. For lower values of $t$, such as $t=1$ and $t=2$, R-GNN[$t$] achieves a weaker approximation by exchanging fewer messages, yet interestingly, often yield the $C_3$ features required in several planning domains. Furthermore, the new R-GNN[$t$] architecture is the original R-GNN architecture with a suitable transformation applied to the input states only. Experimental results illustrate the clear performance gains of R-GNN[$1$] and R-GNN[$2$] over plain R-GNNs, and also over edge transformers that also approximate $3$-GNNs.
- [207] arXiv:2403.11807 [ pdf , ps , html , other ]
-
Title: How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent EnvironmentsJen-tse Huang , Eric John Li , Man Ho Lam , Tian Liang , Wenxuan Wang , Youliang Yuan , Wenxiang Jiao , Xing Wang , Zhaopeng Tu , Michael R. LyuComments: 16 pages of main text. 11 pages of appendices. 15 figures, 9 tables. Updated scoring schemeSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, GAMA-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 60.5. Moreover, Gemini-1.0-Pro and GPT-3.5 (0613, 1106, 0125) demonstrate similar intelligence on GAMA-Bench. The code and experimental results are made publicly available via this https URL .
- [208] arXiv:2403.11905 [ pdf , ps , html , other ]
-
Title: Tur[k]ingBench: A Challenge Benchmark for Web AgentsKevin Xu , Yeganeh Kordi , Kate Sanders , Yizhong Wang , Adam Byerly , Jack Zhang , Benjamin Van Durme , Daniel KhashabiSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Abstract: Recent chatbots have demonstrated impressive ability to understand and communicate in raw-text form. However, there is more to the world than raw text. For example, humans spend long hours of their time on web pages, where text is intertwined with other modalities and tasks are accomplished in the form of various complex interactions. Can state-of-the-art multi-modal models generalize to such complex domains?
To address this question, we introduce TurkingBench, a benchmark of tasks formulated as web pages containing textual instructions with multi-modal context. Unlike existing work which employs artificially synthesized web pages, here we use natural HTML pages that were originally designed for crowdsourcing workers for various annotation purposes. The HTML instructions of each task are also instantiated with various values (obtained from the crowdsourcing tasks) to form new instances of the task. This benchmark contains 32.2K instances distributed across 158 tasks.
Additionally, to facilitate the evaluation on TurkingBench, we develop an evaluation framework that connects the responses of chatbots to modifications on web pages (modifying a text box, checking a radio, etc.). We evaluate the performance of state-of-the-art models, including language-only, vision-only, and layout-only models, and their combinations, on this benchmark. Our findings reveal that these models perform significantly better than random chance, yet considerable room exists for improvement. We hope this benchmark will help facilitate the evaluation and development of web-based agents. - [209] arXiv:2403.12094 [ pdf , ps , html , other ]
-
Title: Are LLMs Good Cryptic Crossword Solvers?Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Cryptic crosswords are puzzles that rely not only on general knowledge but also on the solver's ability to manipulate language on different levels and deal with various types of wordplay. Previous research suggests that solving such puzzles is a challenge even for modern NLP models. However, the abilities of large language models (LLMs) have not yet been tested on this task. In this paper, we establish the benchmark results for three popular LLMs -- LLaMA2, Mistral, and ChatGPT -- showing that their performance on this task is still far from that of humans.
- [210] arXiv:2403.12106 [ pdf , ps , html , other ]
-
Title: Circular Belief Propagation for Approximate Probabilistic InferenceSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Belief Propagation (BP) is a simple probabilistic inference algorithm, consisting of passing messages between nodes of a graph representing a probability distribution. Its analogy with a neural network suggests that it could have far-ranging applications for neuroscience and artificial intelligence. Unfortunately, it is only exact when applied to cycle-free graphs, which restricts the potential of the algorithm. In this paper, we propose Circular Belief Propagation (CBP), an extension of BP which limits the detrimental effects of message reverberation caused by cycles by learning to detect and cancel spurious correlations and belief amplifications. We show in numerical experiments involving binary probabilistic graphs that CBP far outperforms BP and reaches good performance compared to that of previously proposed algorithms.
- [211] arXiv:2403.12108 [ pdf , ps , html , other ]
-
Title: Does AI help humans make better decisions? A methodological framework for experimental evaluationSubjects: Artificial Intelligence (cs.AI) ; General Economics (econ.GN); Applications (stat.AP); Methodology (stat.ME)
Abstract: The use of Artificial Intelligence (AI) based on data-driven algorithms has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions as compared to a human alone or AI an alone. We introduce a new methodological framework that can be used to answer experimentally this question with no additional assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded experimental design, in which the provision of AI-generated recommendations is randomized across cases with a human making final decisions. Under this experimental design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. We apply the proposed methodology to the data from our own randomized controlled trial of a pretrial risk assessment instrument. We find that AI recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Our analysis also shows that AI-alone decisions generally perform worse than human decisions with or without AI assistance. Finally, AI recommendations tend to impose cash bail on non-white arrestees more often than necessary when compared to white arrestees.
- [212] arXiv:2403.12151 [ pdf , ps , html , other ]
-
Title: Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State ClassificationFilippos Gouidis , Katerina Papantoniou , Konstantinos Papoutsakis Theodore Patkos , Antonis Argyros , Dimitris PlexousakisComments: Accepted at the AAAI-MAKE 24Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.
- [213] arXiv:2403.12153 [ pdf , ps , other ]
-
Title: Routing and Scheduling in Answer Set Programming applied to Multi-Agent Path Finding: Preliminary ReportSubjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
Abstract: We present alternative approaches to routing and scheduling in Answer Set Programming (ASP), and explore them in the context of Multi-agent Path Finding. The idea is to capture the flow of time in terms of partial orders rather than time steps attached to actions and fluents. This also abolishes the need for fixed upper bounds on the length of plans. The trade-off for this avoidance is that (parts of) temporal trajectories must be acyclic, since multiple occurrences of the same action or fluent cannot be distinguished anymore. While this approach provides an interesting alternative for modeling routing, it is without alternative for scheduling since fine-grained timings cannot be represented in ASP in a feasible way. This is different for partial orders that can be efficiently handled by external means such as acyclicity and difference constraints. We formally elaborate upon this idea and present several resulting ASP encodings. Finally, we demonstrate their effectiveness via an empirical analysis.
- [214] arXiv:2403.12162 [ pdf , ps , html , other ]
-
Title: Intelligent Execution through Plan AnalysisComments: Published at IROS 21, 6 pagesSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: Intelligent robots need to generate and execute plans. In order to deal with the complexity of real environments, planning makes some assumptions about the world. When executing plans, the assumptions are usually not met. Most works have focused on the negative impact of this fact and the use of replanning after execution failures. Instead, we focus on the positive impact, or opportunities to find better plans. When planning, the proposed technique finds and stores those opportunities. Later, during execution, the monitoring system can use them to focus perception and repair the plan, instead of replanning from scratch. Experiments in several paradigmatic robotic tasks show how the approach outperforms standard replanning strategies.
- [215] arXiv:2403.12201 [ pdf , ps , html , other ]
-
Title: Compositional learning of functions in humans and machinesComments: 7 pages, 6 figuresSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: The ability to learn and compose functions is foundational to efficient learning and reasoning in humans, enabling flexible generalizations such as creating new dishes from known cooking processes. Beyond sequential chaining of functions, existing linguistics literature indicates that humans can grasp more complex compositions with interacting functions, where output production depends on context changes induced by different function orderings. Extending the investigation into the visual domain, we developed a function learning paradigm to explore the capacity of humans and neural network models in learning and reasoning with compositional functions under varied interaction conditions. Following brief training on individual functions, human participants were assessed on composing two learned functions, in ways covering four main interaction types, including instances in which the application of the first function creates or removes the context for applying the second function. Our findings indicate that humans can make zero-shot generalizations on novel visual function compositions across interaction conditions, demonstrating sensitivity to contextual changes. A comparison with a neural network model on the same task reveals that, through the meta-learning for compositionality (MLC) approach, a standard sequence-to-sequence Transformer can mimic human generalization patterns in composing functions.
- [216] arXiv:2403.12308 [ pdf , ps , html , other ]
-
Title: Gradient-based Fuzzy System Optimisation via Automatic Differentiation -- FuzzyR as a Use CaseSubjects: Artificial Intelligence (cs.AI)
Abstract: Since their introduction, fuzzy sets and systems have become an important area of research known for its versatility in modelling, knowledge representation and reasoning, and increasingly its potential within the context explainable AI. While the applications of fuzzy systems are diverse, there has been comparatively little advancement in their design from a machine learning perspective. In other words, while representations such as neural networks have benefited from a boom in learning capability driven by an increase in computational performance in combination with advances in their training mechanisms and available tool, in particular gradient descent, the impact on fuzzy system design has been limited. In this paper, we discuss gradient-descent-based optimisation of fuzzy systems, focussing in particular on automatic differentiation -- crucial to neural network learning -- with a view to free fuzzy system designers from intricate derivative computations, allowing for more focus on the functional and explainability aspects of their design. As a starting point, we present a use case in FuzzyR which demonstrates how current fuzzy inference system implementations can be adjusted to leverage powerful features of automatic differentiation tools sets, discussing its potential for the future of fuzzy system design.
- [217] arXiv:2403.12406 [ pdf , ps , html , other ]
-
Title: Offline Imitation of Badminton Player Behavior via Experiential Contexts and Brownian MotionComments: PreprintSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: In the dynamic and rapid tactic involvements of turn-based sports, badminton stands out as an intrinsic paradigm that requires alter-dependent decision-making of players. While the advancement of learning from offline expert data in sequential decision-making has been witnessed in various domains, how to rally-wise imitate the behaviors of human players from offline badminton matches has remained underexplored. Replicating opponents' behavior benefits players by allowing them to undergo strategic development with direction before matches. However, directly applying existing methods suffers from the inherent hierarchy of the match and the compounding effect due to the turn-based nature of players alternatively taking actions. In this paper, we propose RallyNet, a novel hierarchical offline imitation learning model for badminton player behaviors: (i) RallyNet captures players' decision dependencies by modeling decision-making processes as a contextual Markov decision process. (ii) RallyNet leverages the experience to generate context as the agent's intent in the rally. (iii) To generate more realistic behavior, RallyNet leverages Geometric Brownian Motion (GBM) to model the interactions between players by introducing a valuable inductive bias for learning player behaviors. In this manner, RallyNet links player intents with interaction models with GBM, providing an understanding of interactions for sports analytics. We extensively validate RallyNet with the largest available real-world badminton dataset consisting of men's and women's singles, demonstrating its ability to imitate player behaviors. Results reveal RallyNet's superiority over offline imitation learning methods and state-of-the-art turn-based approaches, outperforming them by at least 16% in mean rule-based agent normalization score. Furthermore, we discuss various practical use cases to highlight RallyNet's applicability.
- [218] arXiv:2403.12417 [ pdf , ps , html , other ]
-
Title: On Predictive planning and counterfactual learning in active inferenceComments: 13 pages, 8 figuresSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Methodology (stat.ME)
Abstract: Given the rapid advancement of artificial intelligence, understanding the foundations of intelligent behaviour is increasingly important. Active inference, regarded as a general theory of behaviour, offers a principled approach to probing the basis of sophistication in planning and decision-making. In this paper, we examine two decision-making schemes in active inference based on 'planning' and 'learning from experience'. Furthermore, we also introduce a mixed model that navigates the data-complexity trade-off between these strategies, leveraging the strengths of both to facilitate balanced decision-making. We evaluate our proposed model in a challenging grid-world scenario that requires adaptability from the agent. Additionally, our model provides the opportunity to analyze the evolution of various parameters, offering valuable insights and contributing to an explainable framework for intelligent decision-making.
- [219] arXiv:2403.12451 [ pdf , ps , html , other ]
-
Title: INSIGHT: End-to-End Neuro-Symbolic Visual Reinforcement Learning with Language ExplanationsSubjects: Artificial Intelligence (cs.AI)
Abstract: Neuro-symbolic reinforcement learning (NS-RL) has emerged as a promising paradigm for explainable decision-making, characterized by the interpretability of symbolic policies. For tasks with visual observations, NS-RL entails structured representations for states, but previous algorithms are unable to refine the structured states with reward signals due to a lack of efficiency. Accessibility is also an issue, as extensive domain knowledge is required to interpret current symbolic policies. In this paper, we present a framework that is capable of learning structured states and symbolic policies simultaneously, whose key idea is to overcome the efficiency bottleneck by distilling vision foundation models into a scalable perception module. Moreover, we design a pipeline that uses large language models to generate concise and readable language explanations for policies and decisions. In experiments on nine Atari tasks, our approach demonstrates substantial performance gains over existing NSRL methods. We also showcase explanations for policies and decisions.
- [220] arXiv:2403.12482 [ pdf , ps , html , other ]
-
Title: Embodied LLM Agents Learn to Cooperate in Organized TeamsXudong Guo , Kaixuan Huang , Jiale Liu , Wenhui Fan , Natalia Vélez , Qingyun Wu , Huazheng Wang , Thomas L. Griffiths , Mengdi WangSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
Abstract: Large Language Models (LLMs) have emerged as integral tools for reasoning, planning, and decision-making, drawing upon their extensive world knowledge and proficiency in language-related tasks. LLMs thus hold tremendous potential for natural language interaction within multi-agent systems to foster cooperation. However, LLM agents tend to over-report and comply with any instruction, which may result in information redundancy and confusion in multi-agent cooperation. Inspired by human organizations, this paper introduces a framework that imposes prompt-based organization structures on LLM agents to mitigate these problems. Through a series of experiments with embodied LLM agents and human-agent collaboration, our results highlight the impact of designated leadership on team efficiency, shedding light on the leadership qualities displayed by LLM agents and their spontaneous cooperative behaviors. Further, we harness the potential of LLMs to propose enhanced organizational prompts, via a Criticize-Reflect process, resulting in novel organization structures that reduce communication costs and enhance team efficiency.
- [221] arXiv:2403.12627 [ pdf , ps , html , other ]
-
Title: Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq CodeComments: 11 pagesSubjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO)
Abstract: In the realm of formal theorem proving, the Coq proof assistant stands out for its rigorous approach to verifying mathematical assertions and software correctness. Despite the advances in artificial intelligence and machine learning, the specialized nature of Coq syntax and semantics poses unique challenges for Large Language Models (LLMs). Addressing this gap, we present a comprehensive dataset specifically designed to enhance LLMs' proficiency in interpreting and generating Coq code. This dataset, derived from a collection of over 10,000 Coq source files, encompasses a wide array of propositions, proofs, and definitions, enriched with metadata including source references and licensing information. Our primary aim is to facilitate the development of LLMs capable of generating syntactically correct and semantically meaningful Coq constructs, thereby advancing the frontier of automated theorem proving. Initial experiments with this dataset have showcased its significant potential; models trained on this data exhibited enhanced accuracy in Coq code generation. Notably, a particular experiment revealed that a fine-tuned LLM was capable of generating 141 valid proofs for a basic lemma, highlighting the dataset's utility in facilitating the discovery of diverse and valid proof strategies. This paper discusses the dataset's composition, the methodology behind its creation, and the implications of our findings for the future of machine learning in formal verification. The dataset is accessible for further research and exploration: this https URL
- [222] arXiv:2403.12805 [ pdf , ps , html , other ]
-
Title: Contextual Moral Value Alignment Through Context-Based AggregationPierre Dognin , Jesus Rios , Ronny Luss , Inkit Padhi , Matthew D Riemer , Miao Liu , Prasanna Sattigeri , Manish Nagireddy , Kush R. Varshney , Djallel BouneffoufSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Developing value-aligned AI agents is a complex undertaking and an ongoing challenge in the field of AI. Specifically within the domain of Large Language Models (LLMs), the capability to consolidate multiple independently trained dialogue agents, each aligned with a distinct moral value, into a unified system that can adapt to and be aligned with multiple moral values is of paramount importance. In this paper, we propose a system that does contextual moral value alignment based on contextual aggregation. Here, aggregation is defined as the process of integrating a subset of LLM responses that are best suited to respond to a user input, taking into account features extracted from the user's input. The proposed system shows better results in term of alignment to human value compared to the state of the art.
- [223] arXiv:2403.12869 [ pdf , ps , other ]
-
Title: Regularization in Spider-Style Strategy Discovery and Schedule ConstructionComments: 25 pages, 8 figures, submitted to IJCAR 2024Subjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO)
Abstract: To achieve the best performance, automatic theorem provers often rely on schedules of diverse proving strategies to be tried out (either sequentially or in parallel) on a given problem. In this paper, we report on a large-scale experiment with discovering strategies for the Vampire prover, targeting the FOF fragment of the TPTP library and constructing a schedule for it, based on the ideas of Andrei Voronkov's system Spider. We examine the process from various angles, discuss the difficulty (or ease) of obtaining a strong Vampire schedule for the CASC competition, and establish how well a schedule can be expected to generalize to unseen problems and what factors influence this property.
- [224] arXiv:2403.13311 [ pdf , ps , html , other ]
-
Title: Multi-Robot Connected Fermat Spiral CoverageComments: accepted to ICAPS24Subjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA); Robotics (cs.RO)
Abstract: We introduce the Multi-Robot Connected Fermat Spiral (MCFS), a novel algorithmic framework for Multi-Robot Coverage Path Planning (MCPP) that adapts Connected Fermat Spiral (CFS) from the computer graphics community to multi-robot coordination for the first time. MCFS uniquely enables the orchestration of multiple robots to generate coverage paths that contour around arbitrarily shaped obstacles, a feature that is notably lacking in traditional methods. Our framework not only enhances area coverage and optimizes task performance, particularly in terms of makespan, for workspaces rich in irregular obstacles but also addresses the challenges of path continuity and curvature critical for non-holonomic robots by generating smooth paths without decomposing the workspace. MCFS solves MCPP by constructing a graph of isolines and transforming MCPP into a combinatorial optimization problem, aiming to minimize the makespan while covering all vertices. Our contributions include developing a unified CFS version for scalable and adaptable MCPP, extending it to MCPP with novel optimization techniques for cost reduction and path continuity and smoothness, and demonstrating through extensive experiments that MCFS outperforms existing MCPP methods in makespan, path curvature, coverage ratio, and overlapping ratio. Our research marks a significant step in MCPP, showcasing the fusion of computer graphics and automated planning principles to advance the capabilities of multi-robot systems in complex environments. Our code is available at this https URL .
- [225] arXiv:2403.13313 [ pdf , ps , html , other ]
-
Title: Polaris: A Safety-focused LLM Constellation Architecture for HealthcareSubhabrata Mukherjee , Paul Gamble , Markel Sanz Ausin , Neel Kant , Kriti Aggarwal , Neha Manjunath , Debajyoti Datta , Zhengliang Liu , Jiayuan Ding , Sophia Busacca , Cezanne Bianco , Swapnil Sharma , Rae Lasko , Michelle Voisard , Sanchay Harneja , Darya Filippova , Gerry Meixiong , Kevin Cha , Amir Youssefi , Meyhaa Buvanesh , Howard Weingram , Sebastian Bierman-Lytle , Harpreet Singh Mangat , Kim Parikh , Saad Godil , Alex MillerSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: We develop Polaris, the first safety-focused LLM constellation for real-time patient-AI healthcare conversations. Unlike prior LLM works in healthcare focusing on tasks like question answering, our work specifically focuses on long multi-turn voice conversations. Our one-trillion parameter constellation system is composed of several multibillion parameter LLMs as co-operative agents: a stateful primary agent that focuses on driving an engaging conversation and several specialist support agents focused on healthcare tasks performed by nurses to increase safety and reduce hallucinations. We develop a sophisticated training protocol for iterative co-training of the agents that optimize for diverse objectives. We train our models on proprietary data, clinical care plans, healthcare regulatory documents, medical manuals, and other medical reasoning documents. We align our models to speak like medical professionals, using organic healthcare conversations and simulated ones between patient actors and experienced nurses. This allows our system to express unique capabilities such as rapport building, trust building, empathy and bedside manner. Finally, we present the first comprehensive clinician evaluation of an LLM system for healthcare. We recruited over 1100 U.S. licensed nurses and over 130 U.S. licensed physicians to perform end-to-end conversational evaluations of our system by posing as patients and rating the system on several measures. We demonstrate Polaris performs on par with human nurses on aggregate across dimensions such as medical safety, clinical readiness, conversational quality, and bedside manner. Additionally, we conduct a challenging task-based evaluation of the individual specialist support agents, where we demonstrate our LLM agents significantly outperform a much larger general-purpose LLM (GPT-4) as well as from its own medium-size class (LLaMA-2 70B).
- [226] arXiv:2403.13433 [ pdf , ps , html , other ]
-
Title: AgentGroupChat: An Interactive Group Chat Simulacra For Better Eliciting Emergent BehaviorZhouhong Gu , Xiaoxuan Zhu , Haoran Guo , Lin Zhang , Yin Cai , Hao Shen , Jiangjie Chen , Zheyu Ye , Yifei Dai , Yan Gao , Yao Hu , Hongwei Feng , Yanghua XiaoSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: Language significantly influences the formation and evolution of Human emergent behavior, which is crucial in understanding collective intelligence within human societies. Considering that the study of how language affects human behavior needs to put it into the dynamic scenarios in which it is used, we introduce AgentGroupChat in this paper, a simulation that delves into the complex role of language in shaping collective behavior through interactive debate scenarios. Central to this simulation are characters engaging in dynamic conversation interactions. To enable simulation, we introduce the Verbal Strategist Agent, utilizing large language models to enhance interaction strategies by incorporating elements of persona and action. We set four narrative scenarios based on AgentGroupChat to demonstrate the simulation's capacity to mimic complex language use in group dynamics. Evaluations focus on aligning agent behaviors with human expectations and the emergence of collective behaviors within the simulation. Results reveal that emergent behaviors materialize from a confluence of factors: a conducive environment for extensive information exchange, characters with diverse traits, high linguistic comprehension, and strategic adaptability. During discussions on ``the impact of AI on humanity'' in AgentGroupChat simulation, philosophers commonly agreed that ``AI could enhance societal welfare with judicious limitations'' and even come to a conclusion that ``the essence of true intelligence encompasses understanding the necessity to constrain self abilities''. Additionally, in the competitive domain of casting for primary roles in films in AgentGroupChat, certain actors were ready to reduce their remuneration or accept lesser roles, motivated by their deep-seated desire to contribute to the project.
- [227] arXiv:2403.13441 [ pdf , ps , html , other ]
-
Title: Robustness Verifcation in Neural NetworksComments: 16 pages, 1 figureSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: In this paper we investigate formal verification problems for Neural Network computations. Of central importance will be various robustness and minimization problems such as: Given symbolic specifications of allowed inputs and outputs in form of Linear Programming instances, one question is whether there do exist valid inputs such that the network computes a valid output? And does this property hold for all valid inputs? Do two given networks compute the same function? Is there a smaller network computing the same function?
The complexity of these questions have been investigated recently from a practical point of view and approximated by heuristic algorithms. We complement these achievements by giving a theoretical framework that enables us to interchange security and efficiency questions in neural networks and analyze their computational complexities. We show that the problems are conquerable in a semi-linear setting, meaning that for piecewise linear activation functions and when the sum- or maximum metric is used, most of them are in P or in NP at most. - [228] arXiv:2403.13447 [ pdf , ps , html , other ]
-
Title: HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language ModelsWenqiao Zhang , Tianwei Lin , Jiang Liu , Fangxun Shu , Haoyuan Li , Lei Zhang , He Wanggui , Hao Zhou , Zheqi Lv , Hao Jiang , Juncheng Li , Siliang Tang , Yueting ZhuangSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training.
Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link this https URL }. - [229] arXiv:2403.13518 [ pdf , ps , html , other ]
-
Title: Motion Generation from Fine-grained Textual DescriptionsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract: The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at this https URL .
- [230] arXiv:2403.13705 [ pdf , ps , html , other ]
-
Title: Research Re: search & Re-searchComments: PhD thesis Aske Plaat 20 June 1996. AlphaBeta, SSS*, MTD(f)Subjects: Artificial Intelligence (cs.AI)
Abstract: Search algorithms are often categorized by their node expansion strategy. One option is the depth-first strategy, a simple backtracking strategy that traverses the search space in the order in which successor nodes are generated. An alternative is the best-first strategy, which was designed to make it possible to use domain-specific heuristic information. By exploring promising parts of the search space first, best-first algorithms are usually more efficient than depth-first algorithms.
In programs that play minimax games such as chess and checkers, the efficiency of the search is of crucial importance. Given the success of best-first algorithms in other domains, one would expect them to be used for minimax games too. However, all high-performance game-playing programs are based on a depth-first algorithm.
This study takes a closer look at a depth-first algorithm, AB, and a best-first algorithm, SSS. The prevailing opinion on these algorithms is that SSS offers the potential for a more efficient search, but that its complicated formulation and exponential memory requirements render it impractical. The theoretical part of this work shows that there is a surprisingly straightforward link between the two algorithms -- for all practical purposes, SSS is a special case of AB. Subsequent empirical evidence proves the prevailing opinion on SSS to be wrong: it is not a complicated algorithm, it does not need too much memory, and it is also not more efficient than depth-first search. - [231] arXiv:2403.14077 [ pdf , ps , html , other ]
-
Title: Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media ForensicsShan Jia , Reilin Lyu , Kangran Zhao , Yize Chen , Zhiyuan Yan , Yan Ju , Chuanbo Hu , Xin Li , Baoyuan Wu , Siwei LyuSubjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR)
Abstract: DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation. Detecting DeepFakes is currently solved with programmed machine learning algorithms. In this work, we investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection. We conducted qualitative and quantitative experiments to demonstrate multimodal LLMs and show that they can expose AI-generated images through careful experimental design and prompt engineering. This is interesting, considering that LLMs are not inherently tailored for media forensic tasks, and the process does not require programming. We discuss the limitations of multimodal LLMs for these tasks and suggest possible improvements.
- [232] arXiv:2403.14100 [ pdf , ps , html , other ]
-
Title: Causal knowledge engineering: A case study from COVID-19Steven Mascaro , Yue Wu , Ross Pearson , Owen Woodberry , Jessica Ramsay , Tom Snelling , Ann E. NicholsonComments: 22 pages (plus 19 pages in appendices), 9 figures, submitted for reviewSubjects: Artificial Intelligence (cs.AI)
Abstract: COVID-19 appeared abruptly in early 2020, requiring a rapid response amid a context of great uncertainty. Good quality data and knowledge was initially lacking, and many early models had to be developed with causal assumptions and estimations built in to supplement limited data, often with no reliable approach for identifying, validating and documenting these causal assumptions. Our team embarked on a knowledge engineering process to develop a causal knowledge base consisting of several causal BNs for diverse aspects of COVID-19. The unique challenges of the setting lead to experiments with the elicitation approach, and what emerged was a knowledge engineering method we call Causal Knowledge Engineering (CKE). The CKE provides a structured approach for building a causal knowledge base that can support the development of a variety of application-specific models. Here we describe the CKE method, and use our COVID-19 work as a case study to provide a detailed discussion and analysis of the method.
- [233] arXiv:2403.14102 [ pdf , ps , html , other ]
-
Title: DouRN: Improving DouZero by Residual Neural NetworksJournal-ref: CyberC 2023: 96-99Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Deep reinforcement learning has made significant progress in games with imperfect information, but its performance in the card game Doudizhu (Chinese Poker/Fight the Landlord) remains unsatisfactory. Doudizhu is different from conventional games as it involves three players and combines elements of cooperation and confrontation, resulting in a large state and action space. In 2021, a Doudizhu program called DouZero\cite{zha2021douzero} surpassed previous models without prior knowledge by utilizing traditional Monte Carlo methods and multilayer perceptrons. Building on this work, our study incorporates residual networks into the model, explores different architectural designs, and conducts multi-role testing. Our findings demonstrate that this model significantly improves the winning rate within the same training time. Additionally, we introduce a call scoring system to assist the agent in deciding whether to become a landlord. With these enhancements, our model consistently outperforms the existing version of DouZero and even experienced human players. \footnote{The source code is available at \url{ this https URL .}
- [234] arXiv:2403.14443 [ pdf , ps , html , other ]
-
Title: Language Models Can Reduce Asymmetry in Information MarketsNasim Rahaman , Martin Weiss , Manuel Wüthrich , Yoshua Bengio , Li Erran Li , Chris Pal , Bernhard SchölkopfSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
Abstract: This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.
- [235] arXiv:2403.14566 [ pdf , ps , html , other ]
-
Title: A survey on Concept-based Approaches For Model ImprovementSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: The focus of recent research has shifted from merely improving the metrics based performance of Deep Neural Networks (DNNs) to DNNs which are more interpretable to humans. The field of eXplainable Artificial Intelligence (XAI) has observed various techniques, including saliency-based and concept-based approaches. These approaches explain the model's decisions in simple human understandable terms called Concepts. Concepts are known to be the thinking ground of humans}. Explanations in terms of concepts enable detecting spurious correlations, inherent biases, or clever-hans. With the advent of concept-based explanations, a range of concept representation methods and automatic concept discovery algorithms have been introduced. Some recent works also use concepts for model improvement in terms of interpretability and generalization. We provide a systematic review and taxonomy of various concept representations and their discovery algorithms in DNNs, specifically in vision. We also provide details on concept-based model improvement literature marking the first comprehensive survey of these methods.
- [236] arXiv:2403.14589 [ pdf , ps , html , other ]
-
Title: ReAct Meets ActRe: When Language Agents Enjoy Training Data AutonomySubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Language agents have demonstrated autonomous decision-making abilities by reasoning with foundation models. Recently, efforts have been made to train language agents for performance improvement, with multi-step reasoning and action trajectories as the training data. However, collecting such trajectories still requires considerable human effort, by either artificial annotation or implementations of diverse prompting frameworks. In this work, we propose A$^3$T, a framework that enables the Autonomous Annotation of Agent Trajectories in the style of ReAct. The central role is an ActRe prompting agent, which explains the reason for an arbitrary action. When randomly sampling an external action, the ReAct-style agent could query the ActRe agent with the action to obtain its textual rationales. Novel trajectories are then synthesized by prepending the posterior reasoning from ActRe to the sampled action. In this way, the ReAct-style agent executes multiple trajectories for the failed tasks, and selects the successful ones to supplement its failed trajectory for contrastive self-training. Realized by policy gradient methods with binarized rewards, the contrastive self-training with accumulated trajectories facilitates a closed loop for multiple rounds of language agent self-improvement. We conduct experiments using QLoRA fine-tuning with the open-sourced Mistral-7B-Instruct-v0.2. In AlfWorld, the agent trained with A$^3$T obtains a 1-shot success rate of 96%, and 100% success with 4 iterative rounds. In WebShop, the 1-shot performance of the A$^3$T agent matches human average, and 4 rounds of iterative refinement lead to the performance approaching human experts. A$^3$T agents significantly outperform existing techniques, including prompting with GPT-4, advanced agent frameworks, and fully fine-tuned LLMs.
- [237] arXiv:2403.14705 [ pdf , ps , html , other ]
-
Title: Concept-Best-Matching: Evaluating Compositionality in Emergent CommunicationSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Artificial agents that learn to communicate in order to accomplish a given task acquire communication protocols that are typically opaque to a human. A large body of work has attempted to evaluate the emergent communication via various evaluation measures, with \emph{compositionality} featuring as a prominent desired trait. However, current evaluation procedures do not directly expose the compositionality of the emergent communication. We propose a procedure to assess the compositionality of emergent communication by finding the best-match between emerged words and natural language concepts. The best-match algorithm provides both a global score and a translation-map from emergent words to natural language concepts. To the best of our knowledge, it is the first time that such direct and interpretable mapping between emergent words and human concepts is provided.
- [238] arXiv:2403.14733 [ pdf , ps , other ]
-
Title: Open Knowledge Base Canonicalization with Multi-task LearningComments: arXiv admin note: substantial text overlap with arXiv:2310.16419Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The construction of large open knowledge bases (OKBs) is integral to many knowledge-driven applications on the world wide web such as web search. However, noun phrases and relational phrases in OKBs often suffer from redundancy and ambiguity, which calls for the investigation on OKB canonicalization. Current solutions address OKB canonicalization by devising advanced clustering algorithms and using knowledge graph embedding (KGE) to further facilitate the canonicalization process. Nevertheless, these works fail to fully exploit the synergy between clustering and KGE learning, and the methods designed for these subtasks are sub-optimal. To this end, we put forward a multi-task learning framework, namely MulCanon, to tackle OKB canonicalization. In addition, diffusion model is used in the soft clustering process to improve the noun phrase representations with neighboring information, which can lead to more accurate representations. MulCanon unifies the learning objectives of these sub-tasks, and adopts a two-stage multi-task learning paradigm for training. A thorough experimental study on popular OKB canonicalization benchmarks validates that MulCanon can achieve competitive canonicalization results.
- [239] arXiv:2403.14796 [ pdf , ps , html , other ]
-
Title: Planning and Acting While the Clock TicksAndrew Coles , Erez Karpas , Andrey Lavrinenko , Wheeler Ruml , Solomon Eyal Shimony , Shahaf ShperbergSubjects: Artificial Intelligence (cs.AI)
Abstract: Standard temporal planning assumes that planning takes place offline and then execution starts at time 0. Recently, situated temporal planning was introduced, where planning starts at time 0 and execution occurs after planning terminates. Situated temporal planning reflects a more realistic scenario where time passes during planning. However, in situated temporal planning a complete plan must be generated before any action is executed. In some problems with time pressure, timing is too tight to complete planning before the first action must be executed. For example, an autonomous car that has a truck backing towards it should probably move out of the way now and plan how to get to its destination later. In this paper, we propose a new problem setting: concurrent planning and execution, in which actions can be dispatched (executed) before planning terminates. Unlike previous work on planning and execution, we must handle wall clock deadlines that affect action applicability and goal achievement (as in situated planning) while also supporting dispatching actions before a complete plan has been found. We extend previous work on metareasoning for situated temporal planning to develop an algorithm for this new setting. Our empirical evaluation shows that when there is strong time pressure, our approach outperforms situated temporal planning.
- [240] arXiv:2403.14885 [ pdf , ps , html , other ]
-
Title: Establishing a leader in a pairwise comparisons methodComments: 9 figures, 19 pagesSubjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR); Computers and Society (cs.CY); Discrete Mathematics (cs.DM)
Abstract: Abstract Like electoral systems, decision-making methods are also vulnerable to manipulation by decision-makers. The ability to effectively defend against such threats can only come from thoroughly understanding the manipulation mechanisms. In the presented article, we show two algorithms that can be used to launch a manipulation attack. They allow for equating the weights of two selected alternatives in the pairwise comparison method and, consequently, choosing a leader. The theoretical considerations are accompanied by a Monte Carlo simulation showing the relationship between the size of the PC matrix, the degree of inconsistency, and the ease of manipulation. This work is a continuation of our previous research published in the paper (Szybowski et al., 2023)
- [241] arXiv:2403.14972 [ pdf , ps , html , other ]
-
Title: A Picture Is Worth a Graph: Blueprint Debate on Graph for Multimodal ReasoningComments: Work in progressSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Multiagent Systems (cs.MA); Multimedia (cs.MM)
Abstract: This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate BDoG, achieving state-of-the-art results in Science QA and MMBench with significant improvements over previous methods.
- [242] arXiv:2403.15137 [ pdf , ps , html , other ]
-
Title: CACA Agent: Capability Collaboration based AI AgentComments: 4 pages,5 figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Abstract: As AI Agents based on Large Language Models (LLMs) have shown potential in practical applications across various fields, how to quickly deploy an AI agent and how to conveniently expand the application scenario of AI agents has become a challenge. Previous studies mainly focused on implementing all the reasoning capabilities of AI agents within a single LLM, which often makes the model more complex and also reduces the extensibility of AI agent functionality. In this paper, we propose CACA Agent (Capability Collaboration based AI Agent), using an open architecture inspired by service computing. CACA Agent integrates a set of collaborative capabilities to implement AI Agents, not only reducing the dependence on a single LLM, but also enhancing the extensibility of both the planning abilities and the tools available to AI agents. Utilizing the proposed system, we present a demo to illustrate the operation and the application scenario extension of CACA Agent.
- [243] arXiv:2403.15251 [ pdf , ps , html , other ]
-
Title: Safe Learning of PDDL Domains with Conditional Effects -- Extended VersionSubjects: Artificial Intelligence (cs.AI)
Abstract: Powerful domain-independent planners have been developed to solve various types of planning problems. These planners often require a model of the acting agent's actions, given in some planning domain description language. Manually designing such an action model is a notoriously challenging task. An alternative is to automatically learn action models from observation. Such an action model is called safe if every plan created with it is consistent with the real, unknown action model. Algorithms for learning such safe action models exist, yet they cannot handle domains with conditional or universal effects, which are common constructs in many planning problems. We prove that learning non-trivial safe action models with conditional effects may require an exponential number of samples. Then, we identify reasonable assumptions under which such learning is tractable and propose SAM Learning of Conditional Effects (Conditional-SAM), the first algorithm capable of doing so. We analyze Conditional-SAM theoretically and evaluate it experimentally. Our results show that the action models learned by Conditional-SAM can be used to solve perfectly most of the test set problems in most of the experimented domains.
- [244] arXiv:2403.15297 [ pdf , ps , html , other ]
-
Title: Sphere Neural-Networks for Rational ReasoningSubjects: Artificial Intelligence (cs.AI)
Abstract: The success of Large Language Models (LLMs), e.g., ChatGPT, is witnessed by their planetary popularity, their capability of human-like question-answering, and also by their steadily improved reasoning performance. However, it remains unclear whether LLMs reason. It is an open problem how traditional neural networks can be qualitatively extended to go beyond the statistic paradigm and achieve high-level cognition. Here, we present a minimalist qualitative extension by generalising computational building blocks from vectors to spheres. We propose Sphere Neural Networks (SphNNs) for human-like reasoning through model construction and inspection, and develop SphNN for syllogistic reasoning, a microcosm of human rationality. Instead of training data, SphNN uses a neuro-symbolic transition map of neighbourhood spatial relations to guide transformations from the current sphere configuration towards the target. SphNN is the first neural model that can determine the validity of long-chained syllogistic reasoning in one epoch by constructing sphere configurations as Euler diagrams, with the worst computational complexity of O(N^2). SphNN can evolve into various types of reasoning, such as spatio-temporal reasoning, logical reasoning with negation and disjunction, event reasoning, neuro-symbolic reasoning, and humour understanding (the highest level of cognition). All these suggest a new kind of Herbert A. Simon's scissors with two neural blades. SphNNs will tremendously enhance interdisciplinary collaborations to develop the two neural blades and realise deterministic neural reasoning and human-bounded rationality and elevate LLMs to reliable psychological AI. This work suggests that the non-zero radii of spheres are the missing components that prevent traditional deep-learning systems from reaching the realm of rational reasoning and cause LLMs to be trapped in the swamp of hallucination.
- [245] arXiv:2403.15341 [ pdf , ps , html , other ]
-
Title: Collaborative AI Teaming in Unknown Environments via Active Goal DeductionSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: With the advancements of artificial intelligence (AI), we're seeing more scenarios that require AI to work closely with other agents, whose goals and strategies might not be known beforehand. However, existing approaches for training collaborative agents often require defined and known reward signals and cannot address the problem of teaming with unknown agents that often have latent objectives/rewards. In response to this challenge, we propose teaming with unknown agents framework, which leverages kernel density Bayesian inverse learning method for active goal deduction and utilizes pre-trained, goal-conditioned policies to enable zero-shot policy adaptation. We prove that unbiased reward estimates in our framework are sufficient for optimal teaming with unknown agents. We further evaluate the framework of redesigned multi-agent particle and StarCraft II micromanagement environments with diverse unknown agents of different behaviors/rewards. Empirical results demonstrate that our framework significantly advances the teaming performance of AI and unknown agents in a wide range of collaborative scenarios.
- [246] arXiv:2403.15437 [ pdf , ps , html , other ]
-
Title: Apriori Knowledge in an Era of Computational Opacity: The Role of AI in Mathematical DiscoverySubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); History and Overview (math.HO)
Abstract: Computation is central to contemporary mathematics. Many accept that we can acquire genuine mathematical knowledge of the Four Color Theorem from Appel and Haken's program insofar as it is simply a repetitive application of human forms of mathematical reasoning. Modern LLMs / DNNs are, by contrast, opaque to us in significant ways, and this creates obstacles in obtaining mathematical knowledge from them. We argue, however, that if a proof-checker automating human forms of proof-checking is attached to such machines, then we can obtain apriori mathematical knowledge from them, even though the original machines are entirely opaque to us and the proofs they output are not human-surveyable.
- [247] arXiv:2403.15456 [ pdf , ps , html , other ]
-
Title: WoLF: Wide-scope Large Language Model Framework for CXR UnderstandingComments: 11 pages main paper, 2 pages supplementarySubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Significant methodological strides have been made toward Chest X-ray (CXR) understanding via modern vision-language models (VLMs), demonstrating impressive Visual Question Answering (VQA) and CXR report generation abilities. However, existing CXR understanding frameworks still possess several procedural caveats. (1) Previous methods solely use CXR reports, which are insufficient for comprehensive Visual Question Answering (VQA), especially when additional health-related data like medication history and prior diagnoses are needed. (2) Previous methods use raw CXR reports, which are often arbitrarily structured. While modern language models can understand various text formats, restructuring reports for clearer, organized anatomy-based information could enhance their usefulness. (3) Current evaluation methods for CXR-VQA primarily emphasize linguistic correctness, lacking the capability to offer nuanced assessments of the generated answers. In this work, to address the aforementioned caveats, we introduce WoLF, a Wide-scope Large Language Model Framework for CXR understanding. To resolve (1), we capture multi-faceted records of patients, which are utilized for accurate diagnoses in real-world clinical scenarios. Specifically, we adopt the Electronic Health Records (EHR) to generate instruction-following data suited for CXR understanding. Regarding (2), we enhance report generation performance by decoupling knowledge in CXR reports based on anatomical structure even within the attention step via masked attention. To address (3), we introduce an AI-evaluation protocol optimized for assessing the capabilities of LLM. Through extensive experimental validations, WoLF demonstrates superior performance over other models on MIMIC-CXR in the AI-evaluation arena about VQA (up to +9.47%p mean score) and by metrics about report generation (+7.3%p BLEU-1).
- [248] arXiv:2403.15504 [ pdf , ps , html , other ]
-
Title: SymboSLAM: Semantic Map Generation in a Multi-Agent SystemComments: 14 pages, 11 figuresSubjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA)
Abstract: Sub-symbolic artificial intelligence methods dominate the fields of environment-type classification and Simultaneous Localisation and Mapping. However, a significant area overlooked within these fields is solution transparency for the human-machine interaction space, as the sub-symbolic methods employed for map generation do not account for the explainability of the solutions generated. This paper proposes a novel approach to environment-type classification through Symbolic Simultaneous Localisation and Mapping, SymboSLAM, to bridge the explainability gap. Our method for environment-type classification observes ontological reasoning used to synthesise the context of an environment through the features found within. We achieve explainability within the model by presenting operators with environment-type classifications overlayed by a semantically labelled occupancy map of landmarks and features. We evaluate SymboSLAM with ground-truth maps of the Canberra region, demonstrating method effectiveness. We assessed the system through both simulations and real-world trials.
- [249] arXiv:2403.15574 [ pdf , ps , html , other ]
-
Title: SensoryT5: Infusing Sensorimotor Norms into T5 for Enhanced Fine-grained Emotion ClassificationComments: Accepted by CogALex 2024 conferenceSubjects: Artificial Intelligence (cs.AI)
Abstract: In traditional research approaches, sensory perception and emotion classification have traditionally been considered separate domains. Yet, the significant influence of sensory experiences on emotional responses is undeniable. The natural language processing (NLP) community has often missed the opportunity to merge sensory knowledge with emotion classification. To address this gap, we propose SensoryT5, a neuro-cognitive approach that integrates sensory information into the T5 (Text-to-Text Transfer Transformer) model, designed specifically for fine-grained emotion classification. This methodology incorporates sensory cues into the T5's attention mechanism, enabling a harmonious balance between contextual understanding and sensory awareness. The resulting model amplifies the richness of emotional representations. In rigorous tests across various detailed emotion classification datasets, SensoryT5 showcases improved performance, surpassing both the foundational T5 model and current state-of-the-art works. Notably, SensoryT5's success signifies a pivotal change in the NLP domain, highlighting the potential influence of neuro-cognitive data in refining machine learning models' emotional sensitivity.
- [250] arXiv:2403.15577 [ pdf , ps , html , other ]
-
Title: Autonomous Driving With Perception Uncertainties: Deep-Ensemble Based Adaptive Cruise ControlSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO); Systems and Control (eess.SY)
Abstract: Autonomous driving depends on perception systems to understand the environment and to inform downstream decision-making. While advanced perception systems utilizing black-box Deep Neural Networks (DNNs) demonstrate human-like comprehension, their unpredictable behavior and lack of interpretability may hinder their deployment in safety critical scenarios. In this paper, we develop an Ensemble of DNN regressors (Deep Ensemble) that generates predictions with quantification of prediction uncertainties. In the scenario of Adaptive Cruise Control (ACC), we employ the Deep Ensemble to estimate distance headway to the lead vehicle from RGB images and enable the downstream controller to account for the estimation uncertainty. We develop an adaptive cruise controller that utilizes Stochastic Model Predictive Control (MPC) with chance constraints to provide a probabilistic safety guarantee. We evaluate our ACC algorithm using a high-fidelity traffic simulator and a real-world traffic dataset and demonstrate the ability of the proposed approach to effect speed tracking and car following while maintaining a safe distance headway. The out-of-distribution scenarios are also examined.
- [251] arXiv:2403.15586 [ pdf , ps , html , other ]
-
Title: Generative AI in Education: A Study of Educators' Awareness, Sentiments, and Influencing FactorsSubjects: Artificial Intelligence (cs.AI)
Abstract: The rapid advancement of artificial intelligence (AI) and the expanding integration of large language models (LLMs) have ignited a debate about their application in education. This study delves into university instructors' experiences and attitudes toward AI language models, filling a gap in the literature by analyzing educators' perspectives on AI's role in the classroom and its potential impacts on teaching and learning. The objective of this research is to investigate the level of awareness, overall sentiment towardsadoption, and the factors influencing these attitudes for LLMs and generative AI-based tools in higher education. Data was collected through a survey using a Likert scale, which was complemented by follow-up interviews to gain a more nuanced understanding of the instructors' viewpoints. The collected data was processed using statistical and thematic analysis techniques. Our findings reveal that educators are increasingly aware of and generally positive towards these tools. We find no correlation between teaching style and attitude toward generative AI. Finally, while CS educators show far more confidence in their technical understanding of generative AI tools and more positivity towards them than educators in other fields, they show no more confidence in their ability to detect AI-generated work.
- [252] arXiv:2403.15587 [ pdf , ps , html , other ]
-
Title: Large language models for crowd decision making based on prompt design strategies using ChatGPT: models, analysis and challengesSubjects: Artificial Intelligence (cs.AI)
Abstract: Social Media and Internet have the potential to be exploited as a source of opinion to enrich Decision Making solutions. Crowd Decision Making (CDM) is a methodology able to infer opinions and decisions from plain texts, such as reviews published in social media platforms, by means of Sentiment Analysis. Currently, the emergence and potential of Large Language Models (LLMs) lead us to explore new scenarios of automatically understand written texts, also known as natural language processing. This paper analyzes the use of ChatGPT based on prompt design strategies to assist in CDM processes to extract opinions and make decisions. We integrate ChatGPT in CDM processes as a flexible tool that infer the opinions expressed in texts, providing numerical or linguistic evaluations where the decision making models are based on the prompt design strategies. We include a multi-criteria decision making scenario with a category ontology for criteria. We also consider ChatGPT as an end-to-end CDM model able to provide a general opinion and score on the alternatives. We conduct empirical experiments on real data extracted from TripAdvisor, the TripR-2020Large dataset. The analysis of results show a promising branch for developing quality decision making models using ChatGPT. Finally, we discuss the challenges of consistency, sensitivity and explainability associated to the use of LLMs in CDM processes, raising open questions for future studies.
- [253] arXiv:2403.15640 [ pdf , ps , html , other ]
-
Title: Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-MakingSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper introduces a novel multi-armed bandits framework, termed Contextual Restless Bandits (CRB), for complex online decision-making. This CRB framework incorporates the core features of contextual bandits and restless bandits, so that it can model both the internal state transitions of each arm and the influence of external global environmental contexts. Using the dual decomposition method, we develop a scalable index policy algorithm for solving the CRB problem, and theoretically analyze the asymptotical optimality of this algorithm. In the case when the arm models are unknown, we further propose a model-based online learning algorithm based on the index policy to learn the arm models and make decisions simultaneously. Furthermore, we apply the proposed CRB framework and the index policy algorithm specifically to the demand response decision-making problem in smart grids. The numerical simulations demonstrate the performance and efficiency of our proposed CRB approaches.
- [254] arXiv:2403.15696 [ pdf , ps , html , other ]
-
Title: MixRED: A Mix-lingual Relation Extraction DatasetSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Relation extraction is a critical task in the field of natural language processing with numerous real-world applications. Existing research primarily focuses on monolingual relation extraction or cross-lingual enhancement for relation extraction. Yet, there remains a significant gap in understanding relation extraction in the mix-lingual (or code-switching) scenario, where individuals intermix contents from different languages within sentences, generating mix-lingual content. Due to the lack of a dedicated dataset, the effectiveness of existing relation extraction models in such a scenario is largely unexplored. To address this issue, we introduce a novel task of considering relation extraction in the mix-lingual scenario called MixRE and constructing the human-annotated dataset MixRED to support this task. In addition to constructing the MixRED dataset, we evaluate both state-of-the-art supervised models and large language models (LLMs) on MixRED, revealing their respective advantages and limitations in the mix-lingual scenario. Furthermore, we delve into factors influencing model performance within the MixRE task and uncover promising directions for enhancing the performance of both supervised models and LLMs in this novel task.
- [255] arXiv:2403.15728 [ pdf , ps , other ]
-
Title: Learnable WSN Deployment of Evidential Collaborative Sensing ModelSubjects: Artificial Intelligence (cs.AI)
Abstract: In wireless sensor networks (WSNs), coverage and deployment are two most crucial issues when conducting detection tasks. However, the detection information collected from sensors is oftentimes not fully utilized and efficiently integrated. Such sensing model and deployment strategy, thereby, cannot reach the maximum quality of coverage, particularly when the amount of sensors within WSNs expands significantly. In this article, we aim at achieving the optimal coverage quality of WSN deployment. We develop a collaborative sensing model of sensors to enhance detection capabilities of WSNs, by leveraging the collaborative information derived from the combination rule under the framework of evidence theory. In this model, the performance evaluation of evidential fusion systems is adopted as the criterion of the sensor selection. A learnable sensor deployment network (LSDNet) considering both sensor contribution and detection capability, is proposed for achieving the optimal deployment of WSNs. Moreover, we deeply investigate the algorithm for finding the requisite minimum number of sensors that realizes the full coverage of WSNs. A series of numerical examples, along with an application of forest area monitoring, are employed to demonstrate the effectiveness and the robustness of the proposed algorithms.
- [256] arXiv:2403.15760 [ pdf , ps , html , other ]
-
Title: An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated LearningComments: Accepted by CVPR2024Subjects: Artificial Intelligence (cs.AI) ; Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Heterogeneous Federated Learning (HtFL) enables collaborative learning on multiple clients with different model architectures while preserving privacy. Despite recent research progress, knowledge sharing in HtFL is still difficult due to data and model heterogeneity. To tackle this issue, we leverage the knowledge stored in pre-trained generators and propose a new upload-efficient knowledge transfer scheme called Federated Knowledge-Transfer Loop (FedKTL). Our FedKTL can produce client-task-related prototypical image-vector pairs via the generator's inference on the server. With these pairs, each client can transfer pre-existing knowledge from the generator to its local model through an additional supervised local task. We conduct extensive experiments on four datasets under two types of data heterogeneity with 14 kinds of models including CNNs and ViTs. Results show that our upload-efficient FedKTL surpasses seven state-of-the-art methods by up to 7.31% in accuracy. Moreover, our knowledge transfer scheme is applicable in scenarios with only one edge client. Code: this https URL
- [257] arXiv:2403.15779 [ pdf , ps , html , other ]
-
Title: The Frontier of Data Erasure: Machine Unlearning for Large Language ModelsSubjects: Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) are foundational to AI advancements, facilitating applications like predictive text generation. Nonetheless, they pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information from their vast datasets. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns, offering techniques for LLMs to selectively discard certain data. This paper reviews the latest in machine unlearning for LLMs, introducing methods for the targeted forgetting of information to address privacy, ethical, and legal challenges without necessitating full model retraining. It divides existing research into unlearning from unstructured/textual data and structured/classification data, showcasing the effectiveness of these approaches in removing specific data while maintaining model efficacy. Highlighting the practicality of machine unlearning, this analysis also points out the hurdles in preserving model integrity, avoiding excessive or insufficient data removal, and ensuring consistent outputs, underlining the role of machine unlearning in advancing responsible, ethical AI.
- [258] arXiv:2403.15864 [ pdf , ps , html , other ]
-
Title: Using Large Language Models for OntoClean-based Ontology RefinementSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper explores the integration of Large Language Models (LLMs) such as GPT-3.5 and GPT-4 into the ontology refinement process, specifically focusing on the OntoClean methodology. OntoClean, critical for assessing the metaphysical quality of ontologies, involves a two-step process of assigning meta-properties to classes and verifying a set of constraints. Manually conducting the first step proves difficult in practice, due to the need for philosophical expertise and lack of consensus among ontologists. By employing LLMs with two prompting strategies, the study demonstrates that high accuracy in the labelling process can be achieved. The findings suggest the potential for LLMs to enhance ontology refinement, proposing the development of plugin software for ontology tools to facilitate this integration.
- [259] arXiv:2403.15875 [ pdf , ps , html , other ]
-
Title: LAMPER: LanguAge Model and Prompt EngineeRing for zero-shot time series classificationComments: Accepted as tiny paper in ICLR 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: This study constructs the LanguAge Model with Prompt EngineeRing (LAMPER) framework, designed to systematically evaluate the adaptability of pre-trained language models (PLMs) in accommodating diverse prompts and their integration in zero-shot time series (TS) classification. We deploy LAMPER in experimental assessments using 128 univariate TS datasets sourced from the UCR archive. Our findings indicate that the feature representation capacity of LAMPER is influenced by the maximum input token threshold imposed by PLMs.
- [260] arXiv:2403.15879 [ pdf , ps , html , other ]
-
Title: TrustSQL: A Reliability Benchmark for Text-to-SQL Models with Diverse Unanswerable QuestionsComments: under reviewSubjects: Artificial Intelligence (cs.AI)
Abstract: Recent advances in large language models (LLMs) have led to significant improvements in translating natural language questions into SQL queries. While achieving high accuracy in SQL generation is crucial, little is known about the extent to which these text-to-SQL models can reliably handle diverse types of questions encountered during real-world deployment, including unanswerable ones. To explore this aspect, we introduce TrustSQL, a new benchmark designed to assess the reliability of text-to-SQL models in both single-database and cross-database settings. TrustSQL requires models to provide one of two outputs: 1) an SQL prediction or 2) abstention from making an SQL prediction, either due to potential errors in the generated SQL or when faced with unanswerable questions. For model evaluation, we explore various modeling approaches specifically designed for this task: 1) optimizing separate models for answerability detection, SQL generation, and error detection, which are then integrated into a single pipeline; and 2) developing a unified approach that uses a single model to solve this task. Experimental results using our new reliability score show that addressing this challenge involves many different areas of research and opens new avenues for model development. However, none of the methods consistently surpasses the reliability scores of a naive baseline that abstains from SQL predictions for all questions, with varying penalties.
- [261] arXiv:2403.15901 [ pdf , ps , html , other ]
-
Title: MatchSeg: Towards Better Segmentation via Reference Image MatchingSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV)
Abstract: Recently, automated medical image segmentation methods based on deep learning have achieved great success. However, they heavily rely on large annotated datasets, which are costly and time-consuming to acquire. Few-shot learning aims to overcome the need for annotated data by using a small labeled dataset, known as a support set, to guide predicting labels for new, unlabeled images, known as the query set. Inspired by this paradigm, we introduce MatchSeg, a novel framework that enhances medical image segmentation through strategic reference image matching. We leverage contrastive language-image pre-training (CLIP) to select highly relevant samples when defining the support set. Additionally, we design a joint attention module to strengthen the interaction between support and query features, facilitating a more effective knowledge transfer between support and query sets. We validated our method across four public datasets. Experimental results demonstrate superior segmentation performance and powerful domain generalization ability of MatchSeg against existing methods for domain-specific and cross-domain segmentation tasks. Our code is made available at this https URL
- [262] arXiv:2403.15916 [ pdf , ps , html , other ]
-
Title: Multi-agent transformer-accelerated RL for satisfaction of STL specificationsComments: Submitted to L4DC 2024 conferenceSubjects: Artificial Intelligence (cs.AI)
Abstract: One of the main challenges in multi-agent reinforcement learning is scalability as the number of agents increases. This issue is further exacerbated if the problem considered is temporally dependent. State-of-the-art solutions today mainly follow centralized training with decentralized execution paradigm in order to handle the scalability concerns. In this paper, we propose time-dependent multi-agent transformers which can solve the temporally dependent multi-agent problem efficiently with a centralized approach via the use of transformers that proficiently handle the large input. We highlight the efficacy of this method on two problems and use tools from statistics to verify the probability that the trajectories generated under the policy satisfy the task. The experiments show that our approach has superior performance against the literature baseline algorithms in both cases.
- [263] arXiv:2403.15933 [ pdf , ps , html , other ]
-
Title: Understanding Domain-Size Generalization in Markov Logic NetworksComments: Under Review. Minor clarifications added in Lemma 1Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: We study the generalization behavior of Markov Logic Networks (MLNs) across relational structures of different sizes. Multiple works have noticed that MLNs learned on a given domain generalize poorly across domains of different sizes. This behavior emerges from a lack of internal consistency within an MLN when used across different domain sizes. In this paper, we quantify this inconsistency and bound it in terms of the variance of the MLN parameters. The parameter variance also bounds the KL divergence between an MLN's marginal distributions taken from different domain sizes. We use these bounds to show that maximizing the data log-likelihood while simultaneously minimizing the parameter variance corresponds to two natural notions of generalization across domain sizes. Our theoretical results apply to Exponential Random Graphs and other Markov network based relational models. Finally, we observe that solutions known to decrease the variance of the MLN parameters, like regularization and Domain-Size Aware MLNs, increase the internal consistency of the MLNs. We empirically verify our results on four different datasets, with different methods to control parameter variance, showing that controlling parameter variance leads to better generalization.
- [264] arXiv:2403.15961 [ pdf , ps , html , other ]
-
Title: SAT Encoding of Partial Ordering Models for Graph Coloring ProblemsSubjects: Artificial Intelligence (cs.AI) ; Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Logic in Computer Science (cs.LO)
Abstract: In this paper, we suggest new SAT encodings of the partial-ordering based ILP model for the graph coloring problem (GCP) and the bandwidth coloring problem (BCP). The GCP asks for the minimum number of colors that can be assigned to the vertices of a given graph such that each two adjacent vertices get different colors. The BCP is a generalization, where each edge has a weight that enforces a minimal "distance" between the assigned colors, and the goal is to minimize the "largest" color used. For the widely studied GCP, we experimentally compare our new SAT encoding to the state-of-the-art approaches on the DIMACS benchmark set. Our evaluation confirms that this SAT encoding is effective for sparse graphs and even outperforms the state-of-the-art on some DIMACS instances. For the BCP, our theoretical analysis shows that the partial-ordering based SAT and ILP formulations have an asymptotically smaller size than that of the classical assignment-based model. Our practical evaluation confirms not only a dominance compared to the assignment-based encodings but also to the state-of-the-art approaches on a set of benchmark instances. Up to our knowledge, we have solved several open instances of the BCP from the literature for the first time.
- [265] arXiv:2403.16066 [ pdf , ps , html , other ]
-
Title: A Temporal Graph Network Framework for Dynamic RecommendationComments: Presented at the AAAI 2024 Workshop on Recommendation Ecosystems: Modeling, Optimization and Incentive DesignSubjects: Artificial Intelligence (cs.AI)
Abstract: Recommender systems, crucial for user engagement on platforms like e-commerce and streaming services, often lag behind users' evolving preferences due to static data reliance. After Temporal Graph Networks (TGNs) were proposed, various studies have shown that TGN can significantly improve situations where the features of nodes and edges dynamically change over time. However, despite its promising capabilities, it has not been directly applied in recommender systems to date. Our study bridges this gap by directly implementing Temporal Graph Networks (TGN) in recommender systems, a first in this field. Using real-world datasets and a range of graph and history embedding methods, we show TGN's adaptability, confirming its effectiveness in dynamic recommendation scenarios.
- [266] arXiv:2403.16071 [ pdf , ps , other ]
-
Title: Landmark-Guided Cross-Speaker Lip Reading with Mutual Information RegularizationLinzhi Wu , Xingyu Zhang , Yakun Zhang , Changyan Zheng , Tiejun Liu , Liang Xie , Ye Yan , Erwei YinComments: To appear in LREC-COLING 2024Journal-ref: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)Subjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract: Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.
- [267] arXiv:2403.16097 [ pdf , ps , html , other ]
-
Title: Can Language Models Pretend Solvers? Logic Code Simulation with LLMsComments: 12 pages, 8 figuresSubjects: Artificial Intelligence (cs.AI) ; Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
Abstract: Transformer-based large language models (LLMs) have demonstrated significant potential in addressing logic problems. capitalizing on the great capabilities of LLMs for code-related activities, several frameworks leveraging logical solvers for logic reasoning have been proposed recently. While existing research predominantly focuses on viewing LLMs as natural language logic solvers or translators, their roles as logic code interpreters and executors have received limited attention. This study delves into a novel aspect, namely logic code simulation, which forces LLMs to emulate logical solvers in predicting the results of logical programs. To further investigate this novel task, we formulate our three research questions: Can LLMs efficiently simulate the outputs of logic codes? What strength arises along with logic code simulation? And what pitfalls? To address these inquiries, we curate three novel datasets tailored for the logic code simulation task and undertake thorough experiments to establish the baseline performance of LLMs in code simulation. Subsequently, we introduce a pioneering LLM-based code simulation technique, Dual Chains of Logic (DCoL). This technique advocates a dual-path thinking approach for LLMs, which has demonstrated state-of-the-art performance compared to other LLM prompt strategies, achieving a notable improvement in accuracy by 7.06% with GPT-4-Turbo.
- [268] arXiv:2403.16100 [ pdf , ps , html , other ]
-
Title: Specifying Agent Ethics (Blue Sky Ideas)Comments: To appear in Coordination, Organizations, Institutions, Norms and Ethics for Governance of Multi-Agent Systems 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: We consider the question of what properties a Machine Ethics system should have. This question is complicated by the existence of ethical dilemmas with no agreed upon solution. We provide an example to motivate why we do not believe falling back on the elicitation of values from stakeholders is sufficient to guarantee correctness of such systems. We go on to define two broad categories of ethical property that have arisen in our own work and present a challenge to the community to approach this question in a more systematic way.
- [269] arXiv:2403.16101 [ pdf , ps , html , other ]
-
Title: Evaluating Fairness Metrics Across Borders from Human PerceptionsSubjects: Artificial Intelligence (cs.AI)
Abstract: Which fairness metrics are appropriately applicable in your contexts? There may be instances of discordance regarding the perception of fairness, even when the outcomes comply with established fairness metrics. Several surveys have been conducted to evaluate fairness metrics with human perceptions of fairness. However, these surveys were limited in scope, including only a few hundred participants within a single country. In this study, we conduct an international survey to evaluate the appropriateness of various fairness metrics in decision-making scenarios. We collected responses from 1,000 participants in each of China, France, Japan, and the United States, amassing a total of 4,000 responses, to analyze the preferences of fairness metrics. Our survey consists of three distinct scenarios paired with four fairness metrics, and each participant answers their preference for the fairness metric in each case. This investigation explores the relationship between personal attributes and the choice of fairness metrics, uncovering a significant influence of national context on these preferences.
- [270] arXiv:2403.16133 [ pdf , ps , html , other ]
-
Title: SSHPool: The Separated Subgraph-based Hierarchical PoolingSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: In this paper, we develop a novel local graph pooling method, namely the Separated Subgraph-based Hierarchical Pooling (SSHPool), for graph classification. To this end, we commence by assigning the nodes of a sample graph into different clusters, resulting in a family of separated subgraphs. We individually employ a local graph convolution units as the local structure to further compress each subgraph into a coarsened node, transforming the original graph into a coarsened graph. Since these subgraphs are separated by different clusters and the structural information cannot be propagated between them, the local convolution operation can significantly avoid the over-smoothing problem arising in most existing Graph Neural Networks (GNNs). By hierarchically performing the proposed procedures on the resulting coarsened graph, the proposed SSHPool can effectively extract the hierarchical global feature of the original graph structure, encapsulating rich intrinsic structural characteristics. Furthermore, we develop an end-to-end GNN framework associated with the proposed SSHPool module for graph classification. Experimental results demonstrate the superior performance of the proposed model on real-world datasets, significantly outperforming state-of-the-art GNN methods in terms of the classification accuracies.
- [271] arXiv:2403.16162 [ pdf , ps , html , other ]
-
Title: Multi-Task Learning with Multi-Task OptimizationSubjects: Artificial Intelligence (cs.AI)
Abstract: Multi-task learning solves multiple correlated tasks. However, conflicts may exist between them. In such circumstances, a single solution can rarely optimize all the tasks, leading to performance trade-offs. To arrive at a set of optimized yet well-distributed models that collectively embody different trade-offs in one algorithmic pass, this paper proposes to view Pareto multi-task learning through the lens of multi-task optimization. Multi-task learning is first cast as a multi-objective optimization problem, which is then decomposed into a diverse set of unconstrained scalar-valued subproblems. These subproblems are solved jointly using a novel multi-task gradient descent method, whose uniqueness lies in the iterative transfer of model parameters among the subproblems during the course of optimization. A theorem proving faster convergence through the inclusion of such transfers is presented. We investigate the proposed multi-task learning with multi-task optimization for solving various problem settings including image classification, scene understanding, and multi-target regression. Comprehensive experiments confirm that the proposed method significantly advances the state-of-the-art in discovering sets of Pareto-optimized models. Notably, on the large image dataset we tested on, namely NYUv2, the hypervolume convergence achieved by our method was found to be nearly two times faster than the next-best among the state-of-the-art.
- [272] arXiv:2403.16190 [ pdf , ps , html , other ]
-
Title: Logic-based Explanations for Linear Support Vector Classifiers with Reject OptionFrancisco Mateus Rocha Filho , Thiago Alves Rocha , Reginaldo Pereira Fernandes Ribeiro , Ajalmar Rêgo da Rocha NetoComments: 16 pages, submitted to BRACIS 2023 (Brazilian Conference on Intelligent Systems), accepted version published in Intelligent Systems, LNCS, vol 14195Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Abstract: Support Vector Classifier (SVC) is a well-known Machine Learning (ML) model for linear classification problems. It can be used in conjunction with a reject option strategy to reject instances that are hard to correctly classify and delegate them to a specialist. This further increases the confidence of the model. Given this, obtaining an explanation of the cause of rejection is important to not blindly trust the obtained results. While most of the related work has developed means to give such explanations for machine learning models, to the best of our knowledge none have done so for when reject option is present. We propose a logic-based approach with formal guarantees on the correctness and minimality of explanations for linear SVCs with reject option. We evaluate our approach by comparing it to Anchors, which is a heuristic algorithm for generating explanations. Obtained results show that our proposed method gives shorter explanations with reduced time cost.
- [273] arXiv:2403.16206 [ pdf , ps , other ]
-
Title: Rumor Detection with a novel graph neural network approachComments: 10 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: The wide spread of rumors on social media has caused a negative impact on people's daily life, leading to potential panic, fear, and mental health problems for the public. How to debunk rumors as early as possible remains a challenging problem. Existing studies mainly leverage information propagation structure to detect rumors, while very few works focus on correlation among users that they may coordinate to spread rumors in order to gain large popularity. In this paper, we propose a new detection model, that jointly learns both the representations of user correlation and information propagation to detect rumors on social media. Specifically, we leverage graph neural networks to learn the representations of user correlation from a bipartite graph that describes the correlations between users and source tweets, and the representations of information propagation with a tree structure. Then we combine the learned representations from these two modules to classify the rumors. Since malicious users intend to subvert our model after deployment, we further develop a greedy attack scheme to analyze the cost of three adversarial attacks: graph attack, comment attack, and joint attack. Evaluation results on two public datasets illustrate that the proposed MODEL outperforms the state-of-the-art rumor detection models. We also demonstrate our method performs well for early rumor detection. Moreover, the proposed detection method is more robust to adversarial attacks compared to the best existing method. Importantly, we show that it requires a high cost for attackers to subvert user correlation pattern, demonstrating the importance of considering user correlation for rumor detection.
- [274] arXiv:2403.16222 [ pdf , ps , html , other ]
-
Title: Cyber-Security Knowledge Graph Generation by Hierarchical Nonnegative Matrix FactorizationRyan Barron , Maksim E. Eren , Manish Bhattarai , Selma Wanna , Nicholas Solovyev , Kim Rasmussen , Boian S. Alexandrov , Charles Nicholas , Cynthia MatuszekComments: Accepted at IEEE ISDFSSubjects: Artificial Intelligence (cs.AI)
Abstract: Much of human knowledge in cybersecurity is encapsulated within the ever-growing volume of scientific papers. As this textual data continues to expand, the importance of document organization methods becomes increasingly crucial for extracting actionable insights hidden within large text datasets. Knowledge Graphs (KGs) serve as a means to store factual information in a structured manner, providing explicit, interpretable knowledge that includes domain-specific information from the cybersecurity scientific literature. One of the challenges in constructing a KG from scientific literature is the extraction of ontology from unstructured text. In this paper, we address this topic and introduce a method for building a multi-modal KG by extracting structured ontology from scientific papers. We demonstrate this concept in the cybersecurity domain. One modality of the KG represents observable information from the papers, such as the categories in which they were published or the authors. The second modality uncovers latent (hidden) patterns of text extracted through hierarchical and semantic non-negative matrix factorization (NMF), such as named entities, topics or clusters, and keywords. We illustrate this concept by consolidating more than two million scientific papers uploaded to arXiv into the cyber-domain, using hierarchical and semantic NMF, and by building a cyber-domain-specific KG.
- [275] arXiv:2403.16289 [ pdf , ps , html , other ]
-
Title: Engineering Safety Requirements for Autonomous Driving with Large Language ModelsComments: Accepted in 32nd IEEE International Requirements Engineering 2024 conference, IcelandSubjects: Artificial Intelligence (cs.AI)
Abstract: Changes and updates in the requirement artifacts, which can be frequent in the automotive domain, are a challenge for SafetyOps. Large Language Models (LLMs), with their impressive natural language understanding and generating capabilities, can play a key role in automatically refining and decomposing requirements after each update. In this study, we propose a prototype of a pipeline of prompts and LLMs that receives an item definition and outputs solutions in the form of safety requirements. This pipeline also performs a review of the requirement dataset and identifies redundant or contradictory requirements. We first identified the necessary characteristics for performing HARA and then defined tests to assess an LLM's capability in meeting these criteria. We used design science with multiple iterations and let experts from different companies evaluate each cycle quantitatively and qualitatively. Finally, the prototype was implemented at a case company and the responsible team evaluated its efficiency.
- [276] arXiv:2403.16393 [ pdf , ps , html , other ]
-
Title: Concurrent Linguistic Error Detection (CLED) for Large Language ModelsComments: 11 pages, 6 figures, 30 referencesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The wide adoption of Large language models (LLMs) makes their dependability a pressing concern. Detection of errors is the first step to mitigating their impact on a system and thus, efficient error detection for LLMs is an important issue. In many settings, the LLM is considered as a black box with no access to the internal nodes; this prevents the use of many error detection schemes that need access to the model's internal nodes. An interesting observation is that the output of LLMs in error-free operation should be valid and normal text. Therefore, when the text is not valid or differs significantly from normal text, it is likely that there is an error. Based on this observation we propose to perform Concurrent Linguistic Error Detection (CLED); this scheme extracts some linguistic features of the text generated by the LLM and feeds them to a concurrent classifier that detects errors. Since the proposed error detection mechanism only relies on the outputs of the model, then it can be used on LLMs in which there is no access to the internal nodes. The proposed CLED scheme has been evaluated on the T5 model when used for news summarization and on the OPUS-MT model when used for translation. In both cases, the same set of linguistic features has been used for error detection to illustrate the applicability of the proposed scheme beyond a specific case. The results show that CLED can detect most of the errors at a low overhead penalty. The use of the concurrent classifier also enables a trade-off between error detection effectiveness and its associated overhead, so providing flexibility to a designer.
- [277] arXiv:2403.16416 [ pdf , ps , html , other ]
-
Title: How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational RecommendationSubjects: Artificial Intelligence (cs.AI)
Abstract: Conversational Recommender System (CRS) interacts with users through natural language to understand their preferences and provide personalized recommendations in real-time. CRS has demonstrated significant potential, prompting researchers to address the development of more realistic and reliable user simulators as a key focus. Recently, the capabilities of Large Language Models (LLMs) have attracted a lot of attention in various fields. Simultaneously, efforts are underway to construct user simulators based on LLMs. While these works showcase innovation, they also come with certain limitations that require attention. In this work, we aim to analyze the limitations of using LLMs in constructing user simulators for CRS, to guide future research. To achieve this goal, we conduct analytical validation on the notable work, iEvaLM. Through multiple experiments on two widely-used datasets in the field of conversational recommendation, we highlight several issues with the current evaluation methods for user simulators based on LLMs: (1) Data leakage, which occurs in conversational history and the user simulator's replies, results in inflated evaluation results. (2) The success of CRS recommendations depends more on the availability and quality of conversational history than on the responses from user simulators. (3) Controlling the output of the user simulator through a single prompt template proves challenging. To overcome these limitations, we propose SimpleUserSim, employing a straightforward strategy to guide the topic toward the target items. Our study validates the ability of CRS models to utilize the interaction information, significantly improving the recommendation results.
- [278] arXiv:2403.16424 [ pdf , ps , other ]
-
Title: An Experiment with the Use of ChatGPT for LCSH Subject Assignment on Electronic Theses and DissertationsComments: 20 pagesSubjects: Artificial Intelligence (cs.AI) ; Digital Libraries (cs.DL); Information Retrieval (cs.IR)
Abstract: This study delves into the potential use of Large Language Models (LLMs) for generating Library of Congress Subject Headings (LCSH). The authors employed ChatGPT to generate subject headings for electronic theses and dissertations (ETDs) based on their titles and summaries. The results revealed that although some generated subject headings were valid, there were issues regarding specificity and exhaustiveness. The study showcases that LLMs can serve as a strategic response to the backlog of items awaiting cataloging in academic libraries, while also offering a cost-effective approach for promptly generating LCSH. Nonetheless, human catalogers remain essential for verifying and enhancing the validity, exhaustiveness, and specificity of LCSH generated by LLMs.
- [279] arXiv:2403.16427 [ pdf , ps , html , other ]
-
Title: Re2LLM: Reflective Reinforcement Large Language Model for Session-based RecommendationComments: 11 pages, 4 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) are emerging as promising approaches to enhance session-based recommendation (SBR), where both prompt-based and fine-tuning-based methods have been widely investigated to align LLMs with SBR. However, the former methods struggle with optimal prompts to elicit the correct reasoning of LLMs due to the lack of task-specific feedback, leading to unsatisfactory recommendations. Although the latter methods attempt to fine-tune LLMs with domain-specific knowledge, they face limitations such as high computational costs and reliance on open-source backbones. To address such issues, we propose a Reflective Reinforcement Large Language Model (Re2LLM) for SBR, guiding LLMs to focus on specialized knowledge essential for more accurate recommendations effectively and efficiently. In particular, we first design the Reflective Exploration Module to effectively extract knowledge that is readily understandable and digestible by LLMs. To be specific, we direct LLMs to examine recommendation errors through self-reflection and construct a knowledge base (KB) comprising hints capable of rectifying these errors. To efficiently elicit the correct reasoning of LLMs, we further devise the Reinforcement Utilization Module to train a lightweight retrieval agent. It learns to select hints from the constructed KB based on the task-specific feedback, where the hints can serve as guidance to help correct LLMs reasoning for better recommendations. Extensive experiments on multiple real-world datasets demonstrate that our method consistently outperforms state-of-the-art methods.
- [280] arXiv:2403.16501 [ pdf , ps , html , other ]
-
Title: Learning To Guide Human Decision Makers With Vision-Language ModelsSubjects: Artificial Intelligence (cs.AI)
Abstract: There is increasing interest in developing AIs for assisting human decision-making in high-stakes tasks, such as medical diagnosis, for the purpose of improving decision quality and reducing cognitive strain. Mainstream approaches team up an expert with a machine learning model to which safer decisions are offloaded, thus letting the former focus on cases that demand their attention. his separation of responsibilities setup, however, is inadequate for high-stakes scenarios. On the one hand, the expert may end up over-relying on the machine's decisions due to anchoring bias, thus losing the human oversight that is increasingly being required by regulatory agencies to ensure trustworthy AI. On the other hand, the expert is left entirely unassisted on the (typically hardest) decisions on which the model abstained. As a remedy, we introduce learning to guide (LTG), an alternative framework in which - rather than taking control from the human expert - the machine provides guidance useful for decision making, and the human is entirely responsible for coming up with a decision. In order to ensure guidance is interpretable} and task-specific, we develop SLOG, an approach for turning any vision-language model into a capable generator of textual guidance by leveraging a modicum of human feedback. Our empirical evaluation highlights the promise of \method on a challenging, real-world medical diagnosis task.
- [281] arXiv:2403.16508 [ pdf , ps , html , other ]
-
Title: Return to Tradition: Learning Reliable Heuristics with Classical Machine LearningComments: Extended version of ICAPS 2024 paperSubjects: Artificial Intelligence (cs.AI)
Abstract: Current approaches for learning for planning have yet to achieve competitive performance against classical planners in several domains, and have poor overall performance. In this work, we construct novel graph representations of lifted planning tasks and use the WL algorithm to generate features from them. These features are used with classical machine learning methods which have up to 2 orders of magnitude fewer parameters and train up to 3 orders of magnitude faster than the state-of-the-art deep learning for planning models. Our novel approach, WL-GOOSE, reliably learns heuristics from scratch and outperforms the $h^{\text{FF}}$ heuristic in a fair competition setting. It also outperforms or ties with LAMA on 4 out of 10 domains on coverage and 7 out of 10 domains on plan quality. WL-GOOSE is the first learning for planning model which achieves these feats. Furthermore, we study the connections between our novel WL feature generation method, previous theoretically flavoured learning architectures, and Description Logic Features for planning.
- [282] arXiv:2403.16524 [ pdf , ps , html , other ]
-
Title: Harnessing the power of LLMs for normative reasoning in MASsComments: 12 pages, 1 figure, accepted to COINE 2024 workshop at AAMAS 2024 ( this https URL )Subjects: Artificial Intelligence (cs.AI)
Abstract: Software agents, both human and computational, do not exist in isolation and often need to collaborate or coordinate with others to achieve their goals. In human society, social mechanisms such as norms ensure efficient functioning, and these techniques have been adopted by researchers in multi-agent systems (MAS) to create socially aware agents. However, traditional techniques have limitations, such as operating in limited environments often using brittle symbolic reasoning. The advent of Large Language Models (LLMs) offers a promising solution, providing a rich and expressive vocabulary for norms and enabling norm-capable agents that can perform a range of tasks such as norm discovery, normative reasoning and decision-making. This paper examines the potential of LLM-based agents to acquire normative capabilities, drawing on recent Natural Language Processing (NLP) and LLM research. We present our vision for creating normative LLM agents. In particular, we discuss how the recently proposed "LLM agent" approaches can be extended to implement such normative LLM agents. We also highlight challenges in this emerging field. This paper thus aims to foster collaboration between MAS, NLP and LLM researchers in order to advance the field of normative agents.
- [283] arXiv:2403.16527 [ pdf , ps , html , other ]
-
Title: Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the ArtComments: 31 pages, 2 tablesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Robotics (cs.RO)
Abstract: Autonomous systems are soon to be ubiquitous, from manufacturing autonomy to agricultural field robots, and from health care assistants to the entertainment industry. The majority of these systems are developed with modular sub-components for decision-making, planning, and control that may be hand-engineered or learning-based. While these existing approaches have been shown to perform well under the situations they were specifically designed for, they can perform especially poorly in rare, out-of-distribution scenarios that will undoubtedly arise at test-time. The rise of foundation models trained on multiple tasks with impressively large datasets from a variety of fields has led researchers to believe that these models may provide common sense reasoning that existing planners are missing. Researchers posit that this common sense reasoning will bridge the gap between algorithm development and deployment to out-of-distribution tasks, like how humans adapt to unexpected scenarios. Large language models have already penetrated the robotics and autonomous systems domains as researchers are scrambling to showcase their potential use cases in deployment. While this application direction is very promising empirically, foundation models are known to hallucinate and generate decisions that may sound reasonable, but are in fact poor. We argue there is a need to step back and simultaneously design systems that can quantify the certainty of a model's decision, and detect when it may be hallucinating. In this work, we discuss the current use cases of foundation models for decision-making tasks, provide a general definition for hallucinations with examples, discuss existing approaches to hallucination detection and mitigation with a focus on decision problems, and explore areas for further research in this exciting field.
- [284] arXiv:2403.16649 [ pdf , ps , html , other ]
-
Title: CLHA: A Simple yet Effective Contrastive Learning Framework for Human AlignmentFeiteng Fang , Liang Zhu , Min Yang , Xi Feng , Jinchang Hou , Qixuan Zhao , Chengming Li , Xiping Hu , Ruifeng XuSubjects: Artificial Intelligence (cs.AI)
Abstract: Reinforcement learning from human feedback (RLHF) is a crucial technique in aligning large language models (LLMs) with human preferences, ensuring these LLMs behave in beneficial and comprehensible ways to users. However, a longstanding challenge in human alignment techniques based on reinforcement learning lies in their inherent complexity and difficulty in training. To address this challenge, we present a simple yet effective Contrastive Learning Framework for Human Alignment (CLHA) to align LLMs with human preferences directly. CLHA employs a novel rescoring strategy to evaluate the noise within the data by considering its inherent quality and dynamically adjusting the training process. Simultaneously, CLHA utilizes pairwise contrastive loss and adaptive supervised fine-tuning loss to adaptively modify the likelihood of generating responses, ensuring enhanced alignment with human preferences. Using advanced methods, CLHA surpasses other algorithms, showcasing superior performance in terms of reward model scores, automatic evaluations, and human assessments on the widely used ``Helpful and Harmless'' dataset.
- [285] arXiv:2403.16667 [ pdf , ps , html , other ]
-
Title: Deep Reinforcement Learning and Mean-Variance Strategies for Responsible Portfolio OptimizationFernando Acero , Parisa Zehtabi , Nicolas Marchesotti , Michael Cashmore , Daniele Magazzeni , Manuela VelosoComments: Presented at the AAAI 2024 Workshop on AI in Finance for Social ImpactSubjects: Artificial Intelligence (cs.AI)
Abstract: Portfolio optimization involves determining the optimal allocation of portfolio assets in order to maximize a given investment objective. Traditionally, some form of mean-variance optimization is used with the aim of maximizing returns while minimizing risk, however, more recently, deep reinforcement learning formulations have been explored. Increasingly, investors have demonstrated an interest in incorporating ESG objectives when making investment decisions, and modifications to the classical mean-variance optimization framework have been developed. In this work, we study the use of deep reinforcement learning for responsible portfolio optimization, by incorporating ESG states and objectives, and provide comparisons against modified mean-variance approaches. Our results show that deep reinforcement learning policies can provide competitive performance against mean-variance approaches for responsible portfolio allocation across additive and multiplicative utility functions of financial and ESG responsibility objectives.
- [286] arXiv:2403.16728 [ pdf , ps , html , other ]
-
Title: Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber LossComments: 13 pages, 16 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: Diffusion models are known to be vulnerable to outliers in training data. In this paper we study an alternative diffusion loss function, which can preserve the high quality of generated data like the original squared $L_{2}$ loss while at the same time being robust to outliers. We propose to use pseudo-Huber loss function with a time-dependent parameter to allow for the trade-off between robustness on the most vulnerable early reverse-diffusion steps and fine details restoration on the final steps. We show that pseudo-Huber loss with the time-dependent parameter exhibits better performance on corrupted datasets in both image and audio domains. In addition, the loss function we propose can potentially help diffusion models to resist dataset corruption while not requiring data filtering or purification compared to conventional training algorithms.
- [287] arXiv:2403.16732 [ pdf , ps , html , other ]
-
Title: Enabling Uncertainty Estimation in Iterative Neural NetworksSubjects: Artificial Intelligence (cs.AI)
Abstract: Turning pass-through network architectures into iterative ones, which use their own output as input, is a well-known approach for boosting performance. In this paper, we argue that such architectures offer an additional benefit: The convergence rate of their successive outputs is highly correlated with the accuracy of the value to which they converge. Thus, we can use the convergence rate as a useful proxy for uncertainty. This results in an approach to uncertainty estimation that provides state-of-the-art estimates at a much lower computational cost than techniques like Ensembles, and without requiring any modifications to the original iterative model. We demonstrate its practical value by embedding it in two application domains: road detection in aerial images and the estimation of aerodynamic properties of 2D and 3D shapes.
- [288] arXiv:2403.16750 [ pdf , ps , html , other ]
-
Title: All Artificial, Less Intelligence: GenAI through the Lens of Formal VerificationComments: Published in DVCon U.S. 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Modern hardware designs have grown increasingly efficient and complex. However, they are often susceptible to Common Weakness Enumerations (CWEs). This paper is focused on the formal verification of CWEs in a dataset of hardware designs written in SystemVerilog from Regenerative Artificial Intelligence (AI) powered by Large Language Models (LLMs). We applied formal verification to categorize each hardware design as vulnerable or CWE-free. This dataset was generated by 4 different LLMs and features a unique set of designs for each of the 10 CWEs we target in our paper. We have associated the identified vulnerabilities with CWE numbers for a dataset of 60,000 generated SystemVerilog Register Transfer Level (RTL) code. It was also found that most LLMs are not aware of any hardware CWEs; hence they are usually not considered when generating the hardware code. Our study reveals that approximately 60% of the hardware designs generated by LLMs are prone to CWEs, posing potential safety and security risks. The dataset could be ideal for training LLMs and Machine Learning (ML) algorithms to abstain from generating CWE-prone hardware designs.
- [289] arXiv:2403.16808 [ pdf , ps , html , other ]
-
Title: Navigating the EU AI Act: A Methodological Approach to Compliance for Safety-critical ProductsComments: To be published in: 2024 IEEE Conference on Artificial Intelligence (CAI 2024)Subjects: Artificial Intelligence (cs.AI)
Abstract: In December 2023, the European Parliament provisionally agreed on the EU AI Act. This unprecedented regulatory framework for AI systems lays out guidelines to ensure the safety, legality, and trustworthiness of AI products. This paper presents a methodology for interpreting the EU AI Act requirements for high-risk AI systems by leveraging product quality models. We first propose an extended product quality model for AI systems, incorporating attributes relevant to the Act not covered by current quality models. We map the Act requirements to relevant quality attributes with the goal of refining them into measurable characteristics. We then propose a contract-based approach to derive technical requirements at the stakeholder level. This facilitates the development and assessment of AI systems that not only adhere to established quality standards, but also comply with the regulatory requirements outlined in the Act for high-risk (including safety-critical) AI systems. We demonstrate the applicability of this methodology on an exemplary automotive supply chain use case, where several stakeholders interact to achieve EU AI Act compliance.
- [290] arXiv:2403.16824 [ pdf , ps , html , other ]
-
Title: On Policy Reuse: An Expressive Language for Representing and Executing General Policies that Call Other PoliciesComments: ICAPS 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: Recently, a simple but powerful language for expressing and learning general policies and problem decompositions (sketches) has been introduced in terms of rules defined over a set of Boolean and numerical features. In this work, we consider three extensions of this language aimed at making policies and sketches more flexible and reusable: internal memory states, as in finite state controllers; indexical features, whose values are a function of the state and a number of internal registers that can be loaded with objects; and modules that wrap up policies and sketches and allow them to call each other by passing parameters. In addition, unlike general policies that select state transitions rather than ground actions, the new language allows for the selection of such actions. The expressive power of the resulting language for policies and sketches is illustrated through a number of examples.
- [291] arXiv:2403.16858 [ pdf , ps , html , other ]
-
Title: XAIport: A Service Framework for the Early Adoption of XAI in AI Model DevelopmentComments: Accepted at the ICSE'24 conference, NIER trackSubjects: Artificial Intelligence (cs.AI)
Abstract: In this study, we propose the early adoption of Explainable AI (XAI) with a focus on three properties: Quality of explanation, the explanation summaries should be consistent across multiple XAI methods; Architectural Compatibility, for effective integration in XAI, the architecture styles of both the XAI methods and the models to be explained must be compatible with the framework; Configurable operations, XAI explanations are operable, akin to machine learning operations. Thus, an explanation for AI models should be reproducible and tractable to be trustworthy. We present XAIport, a framework of XAI microservices encapsulated into Open APIs to deliver early explanations as observation for learning model quality assurance. XAIport enables configurable XAI operations along with machine learning development. We quantify the operational costs of incorporating XAI with three cloud computer vision services on Microsoft Azure Cognitive Services, Google Cloud Vertex AI, and Amazon Rekognition. Our findings show comparable operational costs between XAI and traditional machine learning, with XAIport significantly improving both cloud AI model performance and explanation stability.
- [292] arXiv:2403.16904 [ pdf , ps , html , other ]
-
Title: Multi-Agent Optimization for Safety Analysis of Cyber-Physical Systems: Position PaperComments: 13 pages, 2 figures, 1 table, "2nd International Workshop on Emerging Ideas and Trends in Engineering of Cyber-Physical Systems, part of Cyber-Physical Systems Week, April 2015, Seattle, USA"Subjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR)
Abstract: Failure Mode, Effects and Criticality Analysis (FMECA) is one of the safety analysis methods recommended by most of the international standards. The classical FMECA is made in a form of a table filled in either manually or by using safety analysis tools. In both cases, the design engineers have to choose the trade-offs between safety and other development constraints. In the case of complex cyber-physical systems (CPS) with thousands of specified constraints, this may lead to severe problems and significantly impact the overall criticality of CPS. In this paper, we propose to adopt optimization techniques to automate the decision making process conducted after FMECA of CPS. We describe a multi-agent based optimization method which extends classical FMECA for offering optimal solutions in terms of criticality and development constraints of CPS.
- [293] arXiv:2403.16908 [ pdf , ps , html , other ]
-
Title: Towards Trustworthy Automated Driving through Qualitative Scene Understanding and ExplanationsComments: SAE International Journal of Connected and Automated VehiclesSubjects: Artificial Intelligence (cs.AI)
Abstract: Understanding driving scenes and communicating automated vehicle decisions are key requirements for trustworthy automated driving. In this article, we introduce the Qualitative Explainable Graph (QXG), which is a unified symbolic and qualitative representation for scene understanding in urban mobility. The QXG enables interpreting an automated vehicle's environment using sensor data and machine learning models. It utilizes spatio-temporal graphs and qualitative constraints to extract scene semantics from raw sensor inputs, such as LiDAR and camera data, offering an interpretable scene model. A QXG can be incrementally constructed in real-time, making it a versatile tool for in-vehicle explanations across various sensor types. Our research showcases the potential of QXG, particularly in the context of automated driving, where it can rationalize decisions by linking the graph with observed actions. These explanations can serve diverse purposes, from informing passengers and alerting vulnerable road users to enabling post-hoc analysis of prior behaviors.
- [294] arXiv:2403.16909 [ pdf , ps , html , other ]
-
Title: Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated DataComments: 14 pages, 16 figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HEADROOM, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.
- [295] arXiv:2403.16984 [ pdf , ps , html , other ]
-
Title: Modelling Commonsense Commonalities with Multi-Facet Concept EmbeddingsSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Concept embeddings offer a practical and efficient mechanism for injecting commonsense knowledge into downstream tasks. Their core purpose is often not to predict the commonsense properties of concepts themselves, but rather to identify commonalities, i.e.\ sets of concepts which share some property of interest. Such commonalities are the basis for inductive generalisation, hence high-quality concept embeddings can make learning easier and more robust. Unfortunately, standard embeddings primarily reflect basic taxonomic categories, making them unsuitable for finding commonalities that refer to more specific aspects (e.g.\ the colour of objects or the materials they are made of). In this paper, we address this limitation by explicitly modelling the different facets of interest when learning concept embeddings. We show that this leads to embeddings which capture a more diverse range of commonsense properties, and consistently improves results in downstream tasks such as ultra-fine entity typing and ontology completion.
- [296] arXiv:2403.17040 [ pdf , ps , other ]
-
Title: Enhancing Graph Representation Learning with Attention-Driven Spiking Neural NetworksSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract: Graph representation learning has become a crucial task in machine learning and data mining due to its potential for modeling complex structures such as social networks, chemical compounds, and biological systems. Spiking neural networks (SNNs) have recently emerged as a promising alternative to traditional neural networks for graph learning tasks, benefiting from their ability to efficiently encode and process temporal and spatial information. In this paper, we propose a novel approach that integrates attention mechanisms with SNNs to improve graph representation learning. Specifically, we introduce an attention mechanism for SNN that can selectively focus on important nodes and corresponding features in a graph during the learning process. We evaluate our proposed method on several benchmark datasets and show that it achieves comparable performance compared to existing graph learning techniques.
- [297] arXiv:2403.17101 [ pdf , ps , other ]
-
Title: AI Consciousness is Inevitable: A Theoretical Computer Science PerspectiveSubjects: Artificial Intelligence (cs.AI)
Abstract: We look at consciousness through the lens of Theoretical Computer Science, a branch of mathematics that studies computation under resource limitations. From this perspective, we develop a formal machine model for consciousness. The model is inspired by Alan Turing's simple yet powerful model of computation and Bernard Baars' theater model of consciousness. Though extremely simple, the model aligns at a high level with many of the major scientific theories of human and animal consciousness, supporting our claim that machine consciousness is inevitable.
- [298] arXiv:2403.17108 [ pdf , ps , html , other ]
-
Title: Graph Protection under Multiple Simultaneous Attacks: A Heuristic ApproachComments: 32 pages, 10 figuresSubjects: Artificial Intelligence (cs.AI)
Abstract: This work focuses on developing an effective meta-heuristic approach to protect against simultaneous attacks on nodes of a network modeled using a graph. Specifically, we focus on the $k$-strong Roman domination problem, a generalization of the well-known Roman domination problem on graphs. This general problem is about assigning integer weights to nodes that represent the number of field armies stationed at each node in order to satisfy the protection constraints while minimizing the total weights. These constraints concern the protection of a graph against any simultaneous attack consisting of $k \in \mathbb{N}$ nodes. An attack is considered repelled if each node labeled 0 can be defended by borrowing an army from one of its neighboring nodes, ensuring that the neighbor retains at least one army for self-defense. The $k$-SRD problem has practical applications in various areas, such as developing counter-terrorism strategies or managing supply chain disruptions. The solution to this problem is notoriously difficult to find, as even checking the feasibility of the proposed solution requires an exponential number of steps. We propose a variable neighborhood search algorithm in which the feasibility of the solution is checked by introducing the concept of quasi-feasibility, which is realized by careful sampling within the set of all possible attacks. Extensive experimental evaluations show the scalability and robustness of the proposed approach compared to the two exact approaches from the literature. Experiments are conducted with random networks from the literature and newly introduced random wireless networks as well as with real-world networks. A practical application scenario, using real-world networks, involves applying our approach to graphs extracted from GeoJSON files containing geographic features of hundreds of cities or larger regions.
- [299] arXiv:2403.17209 [ pdf , ps , other ]
-
Title: Generation of Asset Administration Shell with Large Language Model Agents: Interoperability in Digital Twins with Semantic NodeComments: Pre-print, submitted to IEEE ACCESS, under peer-reviewSubjects: Artificial Intelligence (cs.AI) ; Information Retrieval (cs.IR); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
Abstract: This research introduces a novel approach for assisting the creation of Asset Administration Shell (AAS) instances for digital twin modeling within the context of Industry 4.0, aiming to enhance interoperability in smart manufacturing and reduce manual effort. We construct a "semantic node" data structure to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process "semantic node" and generate AAS instance models from textual technical data. Our evaluation demonstrates a 62-79% effective generation rate, indicating a substantial proportion of manual creation effort can be converted into easier validation effort, thereby reducing the time and cost in creating AAS instance models. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts. Our findings emphasize LLMs' capability in automating AAS instance creation, enhancing semantic interoperability, and contributing to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are released on our GitHub Repository with the link: this https URL
- [300] arXiv:2403.17234 [ pdf , ps , html , other ]
-
Title: Speeding Up Path Planning via Reinforcement Learning in MCTS for Automated ParkingSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO)
Abstract: In this paper, we address a method that integrates reinforcement learning into the Monte Carlo tree search to boost online path planning under fully observable environments for automated parking tasks. Sampling-based planning methods under high-dimensional space can be computationally expensive and time-consuming. State evaluation methods are useful by leveraging the prior knowledge into the search steps, making the process faster in a real-time system. Given the fact that automated parking tasks are often executed under complex environments, a solid but lightweight heuristic guidance is challenging to compose in a traditional analytical way. To overcome this limitation, we propose a reinforcement learning pipeline with a Monte Carlo tree search under the path planning framework. By iteratively learning the value of a state and the best action among samples from its previous cycle's outcomes, we are able to model a value estimator and a policy generator for given states. By doing that, we build up a balancing mechanism between exploration and exploitation, speeding up the path planning process while maintaining its quality without using human expert driver data.
- [301] arXiv:2403.17246 [ pdf , ps , html , other ]
-
Title: TwoStep: Multi-agent Task Planning using Classical Planners and Large Language ModelsComments: 12 pagesSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Multiagent Systems (cs.MA); Robotics (cs.RO)
Abstract: Classical planning formulations like the Planning Domain Definition Language (PDDL) admit action sequences guaranteed to achieve a goal state given an initial state if any are possible. However, reasoning problems defined in PDDL do not capture temporal aspects of action taking, for example that two agents in the domain can execute an action simultaneously if postconditions of each do not interfere with preconditions of the other. A human expert can decompose a goal into largely independent constituent parts and assign each agent to one of these subgoals to take advantage of simultaneous actions for faster execution of plan steps, each using only single agent planning. By contrast, large language models (LLMs) used for directly inferring plan steps do not guarantee execution success, but do leverage commonsense reasoning to assemble action sequences. We combine the strengths of classical planning and LLMs by approximating human intuitions for two-agent planning goal decomposition. We demonstrate that LLM-based goal decomposition leads to faster planning times than solving multi-agent PDDL problems directly while simultaneously achieving fewer plan execution steps than a single agent plan alone and preserving execution success. Additionally, we find that LLM-based approximations of subgoals can achieve similar multi-agent execution steps than those specified by human experts. Website and resources at this https URL
- [302] arXiv:2403.17247 [ pdf , ps , html , other ]
-
Title: DASA: Delay-Adaptive Multi-Agent Stochastic ApproximationNicolo Dal Fabbro , Arman Adibi , H. Vincent Poor , Sanjeev R. Kulkarni , Aritra Mitra , George J. PappasSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract: We consider a setting in which $N$ agents aim to speedup a common Stochastic Approximation (SA) problem by acting in parallel and communicating with a central server. We assume that the up-link transmissions to the server are subject to asynchronous and potentially unbounded time-varying delays. To mitigate the effect of delays and stragglers while reaping the benefits of distributed computation, we propose \texttt{DASA}, a Delay-Adaptive algorithm for multi-agent Stochastic Approximation. We provide a finite-time analysis of \texttt{DASA} assuming that the agents' stochastic observation processes are independent Markov chains. Significantly advancing existing results, \texttt{DASA} is the first algorithm whose convergence rate depends only on the mixing time $\tau_{mix}$ and on the average delay $\tau_{avg}$ while jointly achieving an $N$-fold convergence speedup under Markovian sampling. Our work is relevant for various SA applications, including multi-agent and distributed temporal difference (TD) learning, Q-learning and stochastic optimization with correlated data.
- [303] arXiv:2403.17306 [ pdf , ps , html , other ]
-
Title: Visual Hallucination: Definition, Quantification, and Prescriptive RemediationsAnku Rani , Vipula Rawte , Harshad Sharma , Neeraj Anand , Krishnav Rajbangshi , Amit Sheth , Amitava DasSubjects: Artificial Intelligence (cs.AI)
Abstract: The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI. In recent times, considerable research has focused on detecting and mitigating hallucination in Large Language Models (LLMs). However, it's worth noting that hallucination is also quite prevalent in Vision-Language models (VLMs). In this paper, we offer a fine-grained discourse on profiling VLM hallucination based on two tasks: i) image captioning, and ii) Visual Question Answering (VQA). We delineate eight fine-grained orientations of visual hallucination: i) Contextual Guessing, ii) Identity Incongruity, iii) Geographical Erratum, iv) Visual Illusion, v) Gender Anomaly, vi) VLM as Classifier, vii) Wrong Reading, and viii) Numeric Discrepancy. We curate Visual HallucInation eLiciTation (VHILT), a publicly available dataset comprising 2,000 samples generated using eight VLMs across two tasks of captioning and VQA along with human annotations for the categories as mentioned earlier.
- [304] arXiv:2403.17312 [ pdf , ps , html , other ]
-
Title: ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV CachingComments: ISCA 2024Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Performance (cs.PF)
Abstract: The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation with linear-complexity memory accesses. Yet, this approach requires increasing memory as demand grows for processing longer sequences. The overhead leads to reduced throughput due to I/O bottlenecks and even out-of-memory errors, particularly on resource-constrained systems like a single commodity GPU. In this paper, we propose ALISA, a novel algorithm-system co-design solution to address the challenges imposed by KV caching. On the algorithm level, ALISA prioritizes tokens that are most important in generating a new token via a Sparse Window Attention (SWA) algorithm. SWA introduces high sparsity in attention layers and reduces the memory footprint of KV caching at negligible accuracy loss. On the system level, ALISA employs three-phase token-level dynamical scheduling and optimizes the trade-off between caching and recomputation, thus maximizing the overall performance in resource-constrained systems. In a single GPU-CPU system, we demonstrate that under varying workloads, ALISA improves the throughput of baseline systems such as FlexGen and vLLM by up to 3X and 1.9X, respectively.
- [305] arXiv:2403.17328 [ pdf , ps , html , other ]
-
Title: Learning Traffic Signal Control via Genetic ProgrammingSubjects: Artificial Intelligence (cs.AI) ; Neural and Evolutionary Computing (cs.NE)
Abstract: The control of traffic signals is crucial for improving transportation efficiency. Recently, learning-based methods, especially Deep Reinforcement Learning (DRL), garnered substantial success in the quest for more efficient traffic signal control strategies. However, the design of rewards in DRL highly demands domain knowledge to converge to an effective policy, and the final policy also presents difficulties in terms of explainability. In this work, a new learning-based method for signal control in complex intersections is proposed. In our approach, we design a concept of phase urgency for each signal phase. During signal transitions, the traffic light control strategy selects the next phase to be activated based on the phase urgency. We then proposed to represent the urgency function as an explainable tree structure. The urgency function can calculate the phase urgency for a specific phase based on the current road conditions. Genetic programming is adopted to perform gradient-free optimization of the urgency function. We test our algorithm on multiple public traffic signal control datasets. The experimental results indicate that the tree-shaped urgency function evolved by genetic programming outperforms the baselines, including a state-of-the-art method in the transportation field and a well-known DRL-based method.
- [306] arXiv:2403.17333 [ pdf , ps , html , other ]
-
Title: The Pursuit of Fairness in Artificial Intelligence Models: A SurveyComments: 37 pages, 6 figuresSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Artificial Intelligence (AI) models are now being utilized in all facets of our lives such as healthcare, education and employment. Since they are used in numerous sensitive environments and make decisions that can be life altering, potential biased outcomes are a pressing matter. Developers should ensure that such models don't manifest any unexpected discriminatory practices like partiality for certain genders, ethnicities or disabled people. With the ubiquitous dissemination of AI systems, researchers and practitioners are becoming more aware of unfair models and are bound to mitigate bias in them. Significant research has been conducted in addressing such issues to ensure models don't intentionally or unintentionally perpetuate bias. This survey offers a synopsis of the different ways researchers have promoted fairness in AI systems. We explore the different definitions of fairness existing in the current literature. We create a comprehensive taxonomy by categorizing different types of bias and investigate cases of biased AI in different application domains. A thorough study is conducted of the approaches and techniques employed by researchers to mitigate bias in AI models. Moreover, we also delve into the impact of biased models on user experience and the ethical considerations to contemplate when developing and deploying such models. We hope this survey helps researchers and practitioners understand the intricate details of fairness and bias in AI systems. By sharing this thorough survey, we aim to promote additional discourse in the domain of equitable and responsible AI.
- [307] arXiv:2403.17350 [ pdf , ps , html , other ]
-
Title: The Solution of the Zodiac Killer's 340-Character CipherSubjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR)
Abstract: The case of the Zodiac Killer is one of the most widely known unsolved serial killer cases in history. The unidentified killer murdered five known victims and terrorized the state of California. He also communicated extensively with the press and law enforcement. Besides his murders, Zodiac was known for his use of ciphers. The first Zodiac cipher was solved within a week of its publication, while the second cipher was solved by the authors after 51 years, when it was discovered to be a transposition and homophonic substitution cipher with unusual qualities. In this paper, we detail the historical significance of this cipher and the numerous efforts which culminated in its solution.
- [308] arXiv:2403.17358 [ pdf , ps , html , other ]
-
Title: Addressing Myopic Constrained POMDP Planning with Recursive Dual AscentComments: Accepted to the 2024 International Conference on Automated Planning and Scheduling (ICAPS)Subjects: Artificial Intelligence (cs.AI)
Abstract: Lagrangian-guided Monte Carlo tree search with global dual ascent has been applied to solve large constrained partially observable Markov decision processes (CPOMDPs) online. In this work, we demonstrate that these global dual parameters can lead to myopic action selection during exploration, ultimately leading to suboptimal decision making. To address this, we introduce history-dependent dual variables that guide local action selection and are optimized with recursive dual ascent. We empirically compare the performance of our approach on a motivating toy example and two large CPOMDPs, demonstrating improved exploration, and ultimately, safer outcomes.
- [309] arXiv:2403.17384 [ pdf , ps , html , other ]
-
Title: Explainable Graph Neural Networks for Observation Impact Analysis in Atmospheric State EstimationSubjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: This paper investigates the impact of observations on atmospheric state estimation in weather forecasting systems using graph neural networks (GNNs) and explainability methods. We integrate observation and Numerical Weather Prediction (NWP) points into a meteorological graph, extracting $k$-hop subgraphs centered on NWP points. Self-supervised GNNs are employed to estimate the atmospheric state by aggregating data within these $k$-hop radii. The study applies gradient-based explainability methods to quantify the significance of different observations in the estimation process. Evaluated with data from 11 satellite and land-based observations, the results highlight the effectiveness of visualizing the importance of observation types, enhancing the understanding and optimization of observational data in weather forecasting.
- [310] arXiv:2403.17395 [ pdf , ps , html , other ]
-
Title: An Open-source End-to-End Logic Optimization Framework for Large-scale Boolean Network with Reinforcement LearningComments: 5 pages, 4 figures, 1 tableSubjects: Artificial Intelligence (cs.AI)
Abstract: We propose an open-source end-to-end logic optimization framework for large-scale boolean network with reinforcement learning.
- [311] arXiv:2403.17419 [ pdf , ps , html , other ]
-
Title: AI Safety: Necessary, but insufficient and possibly problematicComments: AI & Soc (2024)Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY)
Abstract: This article critically examines the recent hype around AI safety. We first start with noting the nature of the AI safety hype as being dominated by governments and corporations, and contrast it with other avenues within AI research on advancing social good. We consider what 'AI safety' actually means, and outline the dominant concepts that the digital footprint of AI safety aligns with. We posit that AI safety has a nuanced and uneasy relationship with transparency and other allied notions associated with societal good, indicating that it is an insufficient notion if the goal is that of societal good in a broad sense. We note that the AI safety debate has already influenced some regulatory efforts in AI, perhaps in not so desirable directions. We also share our concerns on how AI safety may normalize AI that advances structural harm through providing exploitative and harmful AI with a veneer of safety.
- [312] arXiv:2403.17426 [ pdf , ps , html , other ]
-
Title: Knowledge-Powered Recommendation for an Improved Diet Water FootprintComments: 3 pages, 1 figure, AAAI'24Subjects: Artificial Intelligence (cs.AI)
Abstract: According to WWF, 1.1 billion people lack access to water, and 2.7 billion experience water scarcity at least one month a year. By 2025, two-thirds of the world's population may be facing water shortages. This highlights the urgency of managing water usage efficiently, especially in water-intensive sectors like food. This paper proposes a recommendation engine, powered by knowledge graphs, aiming to facilitate sustainable and healthy food consumption. The engine recommends ingredient substitutes in user recipes that improve nutritional value and reduce environmental impact, particularly water footprint. The system architecture includes source identification, information extraction, schema alignment, knowledge graph construction, and user interface development. The research offers a promising tool for promoting healthier eating habits and contributing to water conservation efforts.
- [313] arXiv:2403.17428 [ pdf , ps , html , other ]
-
Title: Aligning Large Language Models for Enhancing Psychiatric Interviews through Symptom Delineation and SummarizationJae-hee So , Joonhwan Chang , Eunji Kim , Junho Na , JiYeon Choi , Jy-yong Sohn , Byung-Hoon Kim , Sang Hui ChuSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)
Abstract: Recent advancements in Large Language Models (LLMs) have accelerated their usage in various domains. Given the fact that psychiatric interviews are goal-oriented and structured dialogues between the professional interviewer and the interviewee, it is one of the most underexplored areas where LLMs can contribute substantial value. Here, we explore the use of LLMs for enhancing psychiatric interviews, by analyzing counseling data from North Korean defectors with traumatic events and mental health issues. Specifically, we investigate whether LLMs can (1) delineate the part of the conversation that suggests psychiatric symptoms and name the symptoms, and (2) summarize stressors and symptoms, based on the interview dialogue transcript. Here, the transcript data was labeled by mental health experts for training and evaluation of LLMs. Our experimental results show that appropriately prompted LLMs can achieve high performance on both the symptom delineation task and the summarization task. This research contributes to the nascent field of applying LLMs to psychiatric interview and demonstrates their potential effectiveness in aiding mental health practitioners.
- [314] arXiv:2403.17532 [ pdf , ps , html , other ]
-
Title: KC-GenRe: A Knowledge-constrained Generative Re-ranking Method Based on Large Language Models for Knowledge Graph CompletionComments: This paper has been accepted for publication in the proceedings of LREC-COLING 2024Subjects: Artificial Intelligence (cs.AI)
Abstract: The goal of knowledge graph completion (KGC) is to predict missing facts among entities. Previous methods for KGC re-ranking are mostly built on non-generative language models to obtain the probability of each candidate. Recently, generative large language models (LLMs) have shown outstanding performance on several tasks such as information extraction and dialog systems. Leveraging them for KGC re-ranking is beneficial for leveraging the extensive pre-trained knowledge and powerful generative capabilities. However, it may encounter new problems when accomplishing the task, namely mismatch, misordering and omission. To this end, we introduce KC-GenRe, a knowledge-constrained generative re-ranking method based on LLMs for KGC. To overcome the mismatch issue, we formulate the KGC re-ranking task as a candidate identifier sorting generation problem implemented by generative LLMs. To tackle the misordering issue, we develop a knowledge-guided interactive training method that enhances the identification and ranking of candidates. To address the omission issue, we design a knowledge-augmented constrained inference method that enables contextual prompting and controlled generation, so as to obtain valid rankings. Experimental results show that KG-GenRe achieves state-of-the-art performance on four datasets, with gains of up to 6.7% and 7.7% in the MRR and Hits@1 metric compared to previous methods, and 9.0% and 11.1% compared to that without re-ranking. Extensive analysis demonstrates the effectiveness of components in KG-GenRe.
- [315] arXiv:2403.17549 [ pdf , ps , other ]
-
Title: Practical Applications of Advanced Cloud Services and Generative AI Systems in Medical Image AnalysisSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV)
Abstract: The medical field is one of the important fields in the application of artificial intelligence technology. With the explosive growth and diversification of medical data, as well as the continuous improvement of medical needs and challenges, artificial intelligence technology is playing an increasingly important role in the medical field. Artificial intelligence technologies represented by computer vision, natural language processing, and machine learning have been widely penetrated into diverse scenarios such as medical imaging, health management, medical information, and drug research and development, and have become an important driving force for improving the level and quality of medical services.The article explores the transformative potential of generative AI in medical imaging, emphasizing its ability to generate syntheticACM-2 data, enhance images, aid in anomaly detection, and facilitate image-to-image translation. Despite challenges like model complexity, the applications of generative models in healthcare, including Med-PaLM 2 technology, show promising results. By addressing limitations in dataset size and diversity, these models contribute to more accurate diagnoses and improved patient outcomes. However, ethical considerations and collaboration among stakeholders are essential for responsible implementation. Through experiments leveraging GANs to augment brain tumor MRI datasets, the study demonstrates how generative AI can enhance image quality and diversity, ultimately advancing medical diagnostics and patient care.
- [316] arXiv:2403.17601 [ pdf , ps , html , other ]
-
Title: LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic SimulationComments: My company has this rule: if the data for external use is not published on official website, it needs to be disclosed through external data disclosure. I have not done external data disclosure before submitting itSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Microscopic traffic simulation plays a crucial role in transportation engineering by providing insights into individual vehicle behavior and overall traffic flow. However, creating a realistic simulator that accurately replicates human driving behaviors in various traffic conditions presents significant challenges. Traditional simulators relying on heuristic models often fail to deliver accurate simulations due to the complexity of real-world traffic environments. Due to the covariate shift issue, existing imitation learning-based simulators often fail to generate stable long-term simulations. In this paper, we propose a novel approach called learner-aware supervised imitation learning to address the covariate shift problem in multi-agent imitation learning. By leveraging a variational autoencoder simultaneously modeling the expert and learner state distribution, our approach augments expert states such that the augmented state is aware of learner state distribution. Our method, applied to urban traffic simulation, demonstrates significant improvements over existing state-of-the-art baselines in both short-term microscopic and long-term macroscopic realism when evaluated on the real-world dataset pNEUMA.
- [317] arXiv:2403.17607 [ pdf , ps , html , other ]
-
Title: Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUsKai Yuan , Christoph Bauinger , Xiangyi Zhang , Pascal Baehr , Matthias Kirchhart , Darius Dabert , Adrien Tousnakhoff , Pierre Boudier , Michael PaulitschSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at this https URL .
- [318] arXiv:2403.17632 [ pdf , ps , html , other ]
-
Title: Data-driven Energy Consumption Modelling for Electric Micromobility using an Open DatasetComments: 7 pages, 5 figures, 4 tables. This manuscript has been accepted by the IEEE ITEC 2024Subjects: Artificial Intelligence (cs.AI) ; Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: The escalating challenges of traffic congestion and environmental degradation underscore the critical importance of embracing E-Mobility solutions in urban spaces. In particular, micro E-Mobility tools such as E-scooters and E-bikes, play a pivotal role in this transition, offering sustainable alternatives for urban commuters. However, the energy consumption patterns for these tools are a critical aspect that impacts their effectiveness in real-world scenarios and is essential for trip planning and boosting user confidence in using these. To this effect, recent studies have utilised physical models customised for specific mobility tools and conditions, but these models struggle with generalization and effectiveness in real-world scenarios due to a notable absence of open datasets for thorough model evaluation and verification. To fill this gap, our work presents an open dataset, collected in Dublin, Ireland, specifically designed for energy modelling research related to E-Scooters and E-Bikes. Furthermore, we provide a comprehensive analysis of energy consumption modelling based on the dataset using a set of representative machine learning algorithms and compare their performance against the contemporary mathematical models as a baseline. Our results demonstrate a notable advantage for data-driven models in comparison to the corresponding mathematical models for estimating energy consumption. Specifically, data-driven models outperform physical models in accuracy by up to 83.83% for E-Bikes and 82.16% for E-Scooters based on an in-depth analysis of the dataset under certain assumptions.
- [319] arXiv:2403.17643 [ pdf , ps , html , other ]
-
Title: S+t-SNE -- Bringing dimensionality reduction to data streamsComments: This preprint has not undergone peer review or any post-submission improvements or corrections. We will soon add a link to the final version of this contribution that underwent peer-review and post-acceptance improvements and was presented at IDA2024 ( this https URL )Journal-ref: Advances in Intelligent Data Analysis XXII. IDA 2024. Lecture Notes in Computer Science, vol 14642., pp 95-106 (2024). Springer, ChamSubjects: Artificial Intelligence (cs.AI) ; Information Retrieval (cs.IR)
Abstract: We present S+t-SNE, an adaptation of the t-SNE algorithm designed to handle infinite data streams. The core idea behind S+t-SNE is to update the t-SNE embedding incrementally as new data arrives, ensuring scalability and adaptability to handle streaming scenarios. By selecting the most important points at each step, the algorithm ensures scalability while keeping informative visualisations. Employing a blind method for drift management adjusts the embedding space, facilitating continuous visualisation of evolving data dynamics. Our experimental evaluations demonstrate the effectiveness and efficiency of S+t-SNE. The results highlight its ability to capture patterns in a streaming scenario. We hope our approach offers researchers and practitioners a real-time tool for understanding and interpreting high-dimensional data.
- [320] arXiv:2403.17653 [ pdf , ps , html , other ]
-
Title: An Extension-based Approach for Computing and Verifying Preferences in Abstract ArgumentationSubjects: Artificial Intelligence (cs.AI)
Abstract: We present an extension-based approach for computing and verifying preferences in an abstract argumentation system. Although numerous argumentation semantics have been developed previously for identifying acceptable sets of arguments from an argumentation framework, there is a lack of justification behind their acceptability based on implicit argument preferences. Preference-based argumentation frameworks allow one to determine what arguments are justified given a set of preferences. Our research considers the inverse of the standard reasoning problem, i.e., given an abstract argumentation framework and a set of justified arguments, we compute what the possible preferences over arguments are. Furthermore, there is a need to verify (i.e., assess) that the computed preferences would lead to the acceptable sets of arguments. This paper presents a novel approach and algorithm for exhaustively computing and enumerating all possible sets of preferences (restricted to three identified cases) for a conflict-free set of arguments in an abstract argumentation framework. We prove the soundness, completeness and termination of the algorithm. The research establishes that preferences are determined using an extension-based approach after the evaluation phase (acceptability of arguments) rather than stated beforehand. In this work, we focus our research study on grounded, preferred and stable semantics. We show that the complexity of computing sets of preferences is exponential in the number of arguments, and thus, describe an approximate approach and algorithm to compute the preferences. Furthermore, we present novel algorithms for verifying (i.e., assessing) the computed preferences. We provide details of the implementation of the algorithms (source code has been made available), various experiments performed to evaluate the algorithms and the analysis of the results.
- [321] arXiv:2403.17683 [ pdf , ps , html , other ]
-
Title: Solution for Emotion Prediction Competition of Workshop on Emotionally and Culturally Intelligent AISubjects: Artificial Intelligence (cs.AI)
Abstract: This report provide a detailed description of the method that we explored and proposed in the WECIA Emotion Prediction Competition (EPC), which predicts a person's emotion through an artistic work with a comment. The dataset of this competition is ArtELingo, designed to encourage work on diversity across languages and cultures. The dataset has two main challenges, namely modal imbalance problem and language-cultural differences problem. In order to address this issue, we propose a simple yet effective approach called single-multi modal with Emotion-Cultural specific prompt(ECSP), which focuses on using the single modal message to enhance the performance of multimodal models and a well-designed prompt to reduce cultural differences problem. To clarify, our approach contains two main blocks: (1)XLM-R\cite{conneau2019unsupervised} based unimodal model and X$^2$-VLM\cite{zeng2022x} based multimodal model (2) Emotion-Cultural specific prompt. Our approach ranked first in the final test with a score of 0.627.
- [322] arXiv:2403.17726 [ pdf , ps , html , other ]
-
Title: Tiny Models are the Computational Saver for Large ModelsSubjects: Artificial Intelligence (cs.AI)
Abstract: This paper introduces TinySaver, an early-exit-like dynamic model compression approach which employs tiny models to substitute large models adaptively. Distinct from traditional compression techniques, dynamic methods like TinySaver can leverage the difficulty differences to allow certain inputs to complete their inference processes early, thereby conserving computational resources. Most existing early exit designs are implemented by attaching additional network branches to the model's backbone. Our study, however, reveals that completely independent tiny models can replace a substantial portion of the larger models' job with minimal impact on performance. Employing them as the first exit can remarkably enhance computational efficiency. By searching and employing the most appropriate tiny model as the computational saver for a given large model, the proposed approaches work as a novel and generic method to model compression. This finding will help the research community in exploring new compression methods to address the escalating computational demands posed by rapidly evolving AI models. Our evaluation of this approach in ImageNet-1k classification demonstrates its potential to reduce the number of compute operations by up to 90%, with only negligible losses in performance, across various modern vision models. The code of this work will be available.
- [323] arXiv:2403.17735 [ pdf , ps , html , other ]
-
Title: Out-of-distribution Rumor Detection via Test-Time AdaptationSubjects: Artificial Intelligence (cs.AI)
Abstract: Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Existing methods for rumor detection have achieved good performance, as they have collected enough corpus from the same data distribution for model training. However, significant distribution shifts between the training data and real-world test data occur due to differences in news topics, social media platforms, languages and the variance in propagation scale caused by news popularity. This leads to a substantial decline in the performance of these existing methods in Out-Of-Distribution (OOD) situations. To address this problem, we propose a simple and efficient method named Test-time Adaptation for Rumor Detection under distribution shifts (TARD). This method models the propagation of news in the form of a propagation graph, and builds propagation graph test-time adaptation framework, enhancing the model's adaptability and robustness when facing OOD problems. Extensive experiments conducted on two group datasets collected from real-world social platforms demonstrate that our framework outperforms the state-of-the-art methods in performance.
- [324] arXiv:2403.17742 [ pdf , ps , html , other ]
-
Title: Using Stratified Sampling to Improve LIME Image ExplanationsSubjects: Artificial Intelligence (cs.AI)
Abstract: We investigate the use of a stratified sampling approach for LIME Image, a popular model-agnostic explainable AI method for computer vision tasks, in order to reduce the artifacts generated by typical Monte Carlo sampling. Such artifacts are due to the undersampling of the dependent variable in the synthetic neighborhood around the image being explained, which may result in inadequate explanations due to the impossibility of fitting a linear regressor on the sampled data. We then highlight a connection with the Shapley theory, where similar arguments about undersampling and sample relevance were suggested in the past. We derive all the formulas and adjustment factors required for an unbiased stratified sampling estimator. Experiments show the efficacy of the proposed approach.
- [325] arXiv:2403.17755 [ pdf , ps , html , other ]
-
Title: DataCook: Crafting Anti-Adversarial Examples for Healthcare Data Copyright ProtectionSubjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In the realm of healthcare, the challenges of copyright protection and unauthorized third-party misuse are increasingly significant. Traditional methods for data copyright protection are applied prior to data distribution, implying that models trained on these data become uncontrollable. This paper introduces a novel approach, named DataCook, designed to safeguard the copyright of healthcare data during the deployment phase. DataCook operates by "cooking" the raw data before distribution, enabling the development of models that perform normally on this processed data. However, during the deployment phase, the original test data must be also "cooked" through DataCook to ensure normal model performance. This process grants copyright holders control over authorization during the deployment phase. The mechanism behind DataCook is by crafting anti-adversarial examples (AntiAdv), which are designed to enhance model confidence, as opposed to standard adversarial examples (Adv) that aim to confuse models. Similar to Adv, AntiAdv introduces imperceptible perturbations, ensuring that the data processed by DataCook remains easily understandable. We conducted extensive experiments on MedMNIST datasets, encompassing both 2D/3D data and the high-resolution variants. The outcomes indicate that DataCook effectively meets its objectives, preventing models trained on AntiAdv from analyzing unauthorized data effectively, without compromising the validity and accuracy of the data in legitimate scenarios. Code and data are available at this https URL .
- [326] arXiv:2403.17778 [ pdf , ps , html , other ]
-
Title: Towards a FAIR Documentation of Workflows and Models in Applied MathematicsSubjects: Artificial Intelligence (cs.AI) ; Databases (cs.DB); Digital Libraries (cs.DL)
Abstract: Modeling-Simulation-Optimization workflows play a fundamental role in applied mathematics. The Mathematical Research Data Initiative, MaRDI, responded to this by developing a FAIR and machine-interpretable template for a comprehensive documentation of such workflows. MaRDMO, a Plugin for the Research Data Management Organiser, enables scientists from diverse fields to document and publish their workflows on the MaRDI Portal seamlessly using the MaRDI template. Central to these workflows are mathematical models. MaRDI addresses them with the MathModDB ontology, offering a structured formal model description. Here, we showcase the interaction between MaRDMO and the MathModDB Knowledge Graph through an algebraic modeling workflow from the Digital Humanities. This demonstration underscores the versatility of both services beyond their original numerical domain.
- [327] arXiv:2403.17787 [ pdf , ps , html , other ]
-
Title: Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security ApplicationsSubjects: Artificial Intelligence (cs.AI) ; Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), such as Gemini-pro, which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data, integrating both textual and visual information on a scale previously unattainable, opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered Gemini-pro LMMs versus fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct tasks: a visually evident task of detecting simple triggers, such as small squares in images, indicative of potential backdoors, and a non-visually evident task of malware classification through visual representations. Our results highlight a significant divergence in performance, with Gemini-pro falling short in accuracy and reliability when compared to fine-tuned ViT models. The ViT models, on the other hand, demonstrate exceptional accuracy, achieving near-perfect performance on both tasks. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.
- [328] arXiv:2403.17814 [ pdf , ps , html , other ]
-
Title: D-PAD: Deep-Shallow Multi-Frequency Patterns Disentangling for Time Series ForecastingSubjects: Artificial Intelligence (cs.AI)
Abstract: In time series forecasting, effectively disentangling intricate temporal patterns is crucial. While recent works endeavor to combine decomposition techniques with deep learning, multiple frequencies may still be mixed in the decomposed components, e.g., trend and seasonal. Furthermore, frequency domain analysis methods, e.g., Fourier and wavelet transforms, have limitations in resolution in the time domain and adaptability. In this paper, we propose D-PAD, a deep-shallow multi-frequency patterns disentangling neural network for time series forecasting. Specifically, a multi-component decomposing (MCD) block is introduced to decompose the series into components with different frequency ranges, corresponding to the "shallow" aspect. A decomposition-reconstruction-decomposition (D-R-D) module is proposed to progressively extract the information of frequencies mixed in the components, corresponding to the "deep" aspect. After that, an interaction and fusion (IF) module is used to further analyze the components. Extensive experiments on seven real-world datasets demonstrate that D-PAD achieves the state-of-the-art performance, outperforming the best baseline by an average of 9.48% and 7.15% in MSE and MAE, respectively.
- [329] arXiv:2403.17826 [ pdf , ps , html , other ]
-
Title: On the Computational Complexity of Stackelberg Planning and Meta-Operator Verification: Technical ReportComments: Presented at ICAPS24Subjects: Artificial Intelligence (cs.AI)
Abstract: Stackelberg planning is a recently introduced single-turn two-player adversarial planning model, where two players are acting in a joint classical planning task, the objective of the first player being hampering the second player from achieving its goal. This places the Stackelberg planning problem somewhere between classical planning and general combinatorial two-player games. But, where exactly? All investigations of Stackelberg planning so far focused on practical aspects. We close this gap by conducting the first theoretical complexity analysis of Stackelberg planning. We show that in general Stackelberg planning is actually no harder than classical planning. Under a polynomial plan-length restriction, however, Stackelberg planning is a level higher up in the polynomial complexity hierarchy, suggesting that compilations into classical planning come with a worst-case exponential plan-length increase. In attempts to identify tractable fragments, we further study its complexity under various planning task restrictions, showing that Stackelberg planning remains intractable where classical planning is not. We finally inspect the complexity of meta-operator verification, a problem that has been recently connected to Stackelberg planning.
- [330] arXiv:2403.17873 [ pdf , ps , html , other ]
-
Title: Addressing Social Misattributions of Large Language Models: An HCXAI-based ApproachComments: Extended version of the manuscript accepted for the ACM CHI Workshop on Human-Centered Explainable AI 2024 (HCXAI24)Subjects: Artificial Intelligence (cs.AI)
Abstract: Human-centered explainable AI (HCXAI) advocates for the integration of social aspects into AI explanations. Central to the HCXAI discourse is the Social Transparency (ST) framework, which aims to make the socio-organizational context of AI systems accessible to their users. In this work, we suggest extending the ST framework to address the risks of social misattributions in Large Language Models (LLMs), particularly in sensitive areas like mental health. In fact LLMs, which are remarkably capable of simulating roles and personas, may lead to mismatches between designers' intentions and users' perceptions of social attributes, risking to promote emotional manipulation and dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To address these issues, we propose enhancing the ST framework with a fifth 'W-question' to clarify the specific social attributions assigned to LLMs by its designers and users. This addition aims to bridge the gap between LLM capabilities and user perceptions, promoting the ethically responsible development and use of LLM-based technology.
- [331] arXiv:2403.17914 [ pdf , ps , html , other ]
-
Title: Hierarchical Multi-label Classification for Fine-level Event Extraction from Aviation Accident ReportsComments: Accepted in INFORMS Journal of Data ScienceSubjects: Artificial Intelligence (cs.AI)
Abstract: A large volume of accident reports is recorded in the aviation domain, which greatly values improving aviation safety. To better use those reports, we need to understand the most important events or impact factors according to the accident reports. However, the increasing number of accident reports requires large efforts from domain experts to label those reports. In order to make the labeling process more efficient, many researchers have started developing algorithms to identify the underlying events from accident reports automatically. This article argues that we can identify the events more accurately by leveraging the event taxonomy. More specifically, we consider the problem a hierarchical classification task where we first identify the coarse-level information and then predict the fine-level information. We achieve this hierarchical classification process by incorporating a novel hierarchical attention module into BERT. To further utilize the information from event taxonomy, we regularize the proposed model according to the relationship and distribution among labels. The effectiveness of our framework is evaluated with the data collected by National Transportation Safety Board (NTSB). It has been shown that fine-level prediction accuracy is highly improved, and the regularization term can be beneficial to the rare event identification problem.
- [332] arXiv:2403.17918 [ pdf , ps , html , other ]
-
Title: AgentStudio: A Toolkit for Building General Virtual AgentsSubjects: Artificial Intelligence (cs.AI)
Abstract: Creating autonomous virtual agents capable of using arbitrary software on any digital device remains a major challenge for artificial intelligence. Two key obstacles hinder progress: insufficient infrastructure for building virtual agents in real-world environments, and the need for in-the-wild evaluation of fundamental agent abilities. To address this, we introduce AgentStudio, an online, realistic, and multimodal toolkit that covers the entire lifecycle of agent development. This includes environment setups, data collection, agent evaluation, and visualization. The observation and action spaces are highly generic, supporting both function calling and human-computer interfaces. This versatility is further enhanced by AgentStudio's graphical user interfaces, which allow efficient development of datasets and benchmarks in real-world settings. To illustrate, we introduce a visual grounding dataset and a real-world benchmark suite, both created with our graphical interfaces. Furthermore, we present several actionable insights derived from AgentStudio, e.g., general visual grounding, open-ended tool creation, learning from videos, etc. We have open-sourced the environments, datasets, benchmarks, and interfaces to promote research towards developing general virtual agents for the future.
- [333] arXiv:2403.18056 [ pdf , ps , html , other ]
-
Title: Self-Clustering Hierarchical Multi-Agent Reinforcement Learning with Extensible Cooperation GraphSubjects: Artificial Intelligence (cs.AI)
Abstract: Multi-Agent Reinforcement Learning (MARL) has been successful in solving many cooperative challenges. However, classic non-hierarchical MARL algorithms still cannot address various complex multi-agent problems that require hierarchical cooperative behaviors. The cooperative knowledge and policies learned in non-hierarchical algorithms are implicit and not interpretable, thereby restricting the integration of existing knowledge. This paper proposes a novel hierarchical MARL model called Hierarchical Cooperation Graph Learning (HCGL) for solving general multi-agent problems. HCGL has three components: a dynamic Extensible Cooperation Graph (ECG) for achieving self-clustering cooperation; a group of graph operators for adjusting the topology of ECG; and an MARL optimizer for training these graph operators. HCGL's key distinction from other MARL models is that the behaviors of agents are guided by the topology of ECG instead of policy neural networks. ECG is a three-layer graph consisting of an agent node layer, a cluster node layer, and a target node layer. To manipulate the ECG topology in response to changing environmental conditions, four graph operators are trained to adjust the edge connections of ECG dynamically. The hierarchical feature of ECG provides a unique approach to merge primitive actions (actions executed by the agents) and cooperative actions (actions executed by the clusters) into a unified action space, allowing us to integrate fundamental cooperative knowledge into an extensible interface. In our experiments, the HCGL model has shown outstanding performance in multi-agent benchmarks with sparse rewards. We also verify that HCGL can easily be transferred to large-scale scenarios with high zero-shot transfer success rates.
- [334] arXiv:2403.18057 [ pdf , ps , html , other ]
-
Title: Prioritized League Reinforcement Learning for Large-Scale Heterogeneous Multiagent SystemsSubjects: Artificial Intelligence (cs.AI)
Abstract: Large-scale heterogeneous multiagent systems feature various realistic factors in the real world, such as agents with diverse abilities and overall system cost. In comparison to homogeneous systems, heterogeneous systems offer significant practical advantages. Nonetheless, they also present challenges for multiagent reinforcement learning, including addressing the non-stationary problem and managing an imbalanced number of agents with different types. We propose a Prioritized Heterogeneous League Reinforcement Learning (PHLRL) method to address large-scale heterogeneous cooperation problems. PHLRL maintains a record of various policies that agents have explored during their training and establishes a heterogeneous league consisting of diverse policies to aid in future policy optimization. Furthermore, we design a prioritized policy gradient approach to compensate for the gap caused by differences in the number of different types of agents. Next, we use Unreal Engine to design a large-scale heterogeneous cooperation benchmark named Large-Scale Multiagent Operation (LSMO), which is a complex two-team competition scenario that requires collaboration from both ground and airborne agents. We use experiments to show that PHLRL outperforms state-of-the-art methods, including QTRAN and QPLEX in LSMO.
- [335] arXiv:2403.18100 [ pdf , ps , other ]
-
Title: Driving Intelligent IoT Monitoring and Control through Cloud Computing and Machine LearningSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: This article explores how to drive intelligent iot monitoring and control through cloud computing and machine learning. As iot and the cloud continue to generate large and diverse amounts of data as sensor devices in the network, the collected data is sent to the cloud for statistical analysis, prediction, and data analysis to achieve business objectives. However, because the cloud computing model is limited by distance, it can be problematic in environments where the quality of the Internet connection is not ideal for critical operations. Therefore, edge computing, as a distributed computing architecture, moves the location of processing applications, data and services from the central node of the network to the logical edge node of the network to reduce the dependence on cloud processing and analysis of data, and achieve near-end data processing and analysis. The combination of iot and edge computing can reduce latency, improve efficiency, and enhance security, thereby driving the development of intelligent systems. The paper also introduces the development of iot monitoring and control technology, the application of edge computing in iot monitoring and control, and the role of machine learning in data analysis and fault detection. Finally, the application and effect of intelligent Internet of Things monitoring and control system in industry, agriculture, medical and other fields are demonstrated through practical cases and experimental studies.
- [336] arXiv:2403.18101 [ pdf , ps , html , other ]
-
Title: Towards Explainable Clustering: A Constrained Declarative based ApproachSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: The domain of explainable AI is of interest in all Machine Learning fields, and it is all the more important in clustering, an unsupervised task whose result must be validated by a domain expert. We aim at finding a clustering that has high quality in terms of classic clustering criteria and that is explainable, and we argue that these two dimensions must be considered when building the clustering. We consider that a good global explanation of a clustering should give the characteristics of each cluster taking into account their abilities to describe its objects (coverage) while distinguishing it from the other clusters (discrimination). Furthermore, we aim at leveraging expert knowledge, at different levels, on the structure of the expected clustering or on its explanations. In our framework an explanation of a cluster is a set of patterns, and we propose a novel interpretable constrained clustering method called ECS for declarative clustering with Explainabilty-driven Cluster Selection that integrates structural or domain expert knowledge expressed by means of constraints. It is based on the notion of coverage and discrimination that are formalized at different levels (cluster / clustering), each allowing for exceptions through parameterized thresholds. Our method relies on four steps: generation of a set of partitions, computation of frequent patterns for each cluster, pruning clusters that violates some constraints, and selection of clusters and associated patterns to build an interpretable clustering. This last step is combinatorial and we have developed a Constraint-Programming (CP) model to solve it. The method can integrate prior knowledge in the form of user constraints, both before or in the CP model.
- [337] arXiv:2403.18120 [ pdf , ps , html , other ]
-
Title: Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with AutoformalizationComments: ICLR 2024Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting -- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. The code can be found at this https URL .
- [338] arXiv:2403.18145 [ pdf , ps , html , other ]
-
Title: A Real-Time Rescheduling Algorithm for Multi-robot Plan ExecutionComments: ICAPS 2024Subjects: Artificial Intelligence (cs.AI) ; Multiagent Systems (cs.MA); Robotics (cs.RO)
Abstract: One area of research in multi-agent path finding is to determine how replanning can be efficiently achieved in the case of agents being delayed during execution. One option is to reschedule the passing order of agents, i.e., the sequence in which agents visit the same location. In response, we propose Switchable-Edge Search (SES), an A*-style algorithm designed to find optimal passing orders. We prove the optimality of SES and evaluate its efficiency via simulations. The best variant of SES takes less than 1 second for small- and medium-sized problems and runs up to 4 times faster than baselines for large-sized problems.
- [339] arXiv:2403.18183 [ pdf , ps , html , other ]
-
Title: Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction ConfidenceSubjects: Artificial Intelligence (cs.AI) ; Information Retrieval (cs.IR)
Abstract: A well-designed document communicates not only through its words but also through its visual eloquence. Authors utilize aesthetic elements such as colors, fonts, graphics, and layouts to shape the perception of information. Thoughtful document design, informed by psychological insights, enhances both the visual appeal and the comprehension of the content. While state-of-the-art document AI models demonstrate the benefits of incorporating layout and image data, it remains unclear whether the nuances of document aesthetics are effectively captured. To bridge the gap between human cognition and AI interpretation of aesthetic elements, we formulated hypotheses concerning AI behavior in document understanding tasks, specifically anchored in document design principles. With a focus on legibility and layout quality, we tested four aspects of aesthetic effects: noise, font-size contrast, alignment, and complexity, on model confidence using correlational analysis. The results and observations highlight the value of model analysis rooted in document design theories. Our work serves as a trailhead for further studies and we advocate for continued research in this topic to deepen our understanding of how AI interprets document aesthetics.
- [340] arXiv:2403.18203 [ pdf , ps , html , other ]
-
Title: EndToEndML: An Open-Source End-to-End Pipeline for Machine Learning ApplicationsComments: 2024 7th International Conference on Information and Computer Technologies (ICICT)Subjects: Artificial Intelligence (cs.AI)
Abstract: Artificial intelligence (AI) techniques are widely applied in the life sciences. However, applying innovative AI techniques to understand and deconvolute biological complexity is hindered by the learning curve for life science scientists to understand and use computing languages. An open-source, user-friendly interface for AI models, that does not require programming skills to analyze complex biological data will be extremely valuable to the bioinformatics community. With easy access to different sequencing technologies and increased interest in different 'omics' studies, the number of biological datasets being generated has increased and analyzing these high-throughput datasets is computationally demanding. The majority of AI libraries today require advanced programming skills as well as machine learning, data preprocessing, and visualization skills. In this research, we propose a web-based end-to-end pipeline that is capable of preprocessing, training, evaluating, and visualizing machine learning (ML) models without manual intervention or coding expertise. By integrating traditional machine learning and deep neural network models with visualizations, our library assists in recognizing, classifying, clustering, and predicting a wide range of multi-modal, multi-sensor datasets, including images, languages, and one-dimensional numerical data, for drug discovery, pathogen classification, and medical diagnostics.
- [341] arXiv:2403.18205 [ pdf , ps , html , other ]
-
Title: Exploring the Privacy Protection Capabilities of Chinese Large Language ModelsComments: 11 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs), renowned for their impressive capabilities in various tasks, have significantly advanced artificial intelligence. Yet, these advancements have raised growing concerns about privacy and security implications. To address these issues and explain the risks inherent in these models, we have devised a three-tiered progressive framework tailored for evaluating privacy in language systems. This framework consists of progressively complex and in-depth privacy test tasks at each tier. Our primary objective is to comprehensively evaluate the sensitivity of large language models to private information, examining how effectively they discern, manage, and safeguard sensitive data in diverse scenarios. This systematic evaluation helps us understand the degree to which these models comply with privacy protection guidelines and the effectiveness of their inherent safeguards against privacy breaches. Our observations indicate that existing Chinese large language models universally show privacy protection shortcomings. It seems that at the moment this widespread issue is unavoidable and may pose corresponding privacy risks in applications based on these models.
- [342] arXiv:2403.18218 [ pdf , ps , html , other ]
-
Title: Leveraging Large Language Models for Fuzzy String Matching in Political ScienceComments: 7 pages, 2 figures, 1 table;Subjects: Artificial Intelligence (cs.AI)
Abstract: Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.
- [343] arXiv:2403.18230 [ pdf , ps , html , other ]
-
Title: Large Language Models Need Consultants for Reasoning: Becoming an Expert in a Complex Human System Through Behavior SimulationSubjects: Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs), in conjunction with various reasoning reinforcement methodologies, have demonstrated remarkable capabilities comparable to humans in fields such as mathematics, law, coding, common sense, and world knowledge. In this paper, we delve into the reasoning abilities of LLMs within complex human systems. We propose a novel reasoning framework, termed ``Mosaic Expert Observation Wall'' (MEOW) exploiting generative-agents-based simulation technique. In the MEOW framework, simulated data are utilized to train an expert model concentrating ``experience'' about a specific task in each independent time of simulation. It is the accumulated ``experience'' through the simulation that makes for an expert on a task in a complex human system. We conduct the experiments within a communication game that mirrors real-world security scenarios. The results indicate that our proposed methodology can cooperate with existing methodologies to enhance the reasoning abilities of LLMs in complex human systems.
- [344] arXiv:2403.18243 [ pdf , ps , html , other ]
-
Title: Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-CheckSubjects: Artificial Intelligence (cs.AI)
Abstract: Retrieval-Augmented Generation (RAG) aims to generate more reliable and accurate responses, by augmenting large language models (LLMs) with the external vast and dynamic knowledge. Most previous work focuses on using RAG for single-round question answering, while how to adapt RAG to the complex conversational setting wherein the question is interdependent on the preceding context is not well studied. In this paper, we propose a conversation-level RAG approach, which incorporates fine-grained retrieval augmentation and self-check for conversational question answering (CQA). In particular, our approach consists of three components, namely conversational question refiner, fine-grained retriever and self-check based response generator, which work collaboratively for question understanding and relevant information acquisition in conversational settings. Extensive experiments demonstrate the great advantages of our approach over the state-of-the-art baselines. Moreover, we also release a Chinese CQA dataset with new features including reformulated question, extracted keyword, retrieved paragraphs and their helpfulness, which facilitates further researches in RAG enhanced CQA.
- [345] arXiv:2403.18278 [ pdf , ps , html , other ]
-
Title: Identification and Uses of Deep Learning Backbones via Pattern MiningComments: 9 pages, 6 figures, published SIAM SDM24Subjects: Artificial Intelligence (cs.AI)
Abstract: Deep learning is extensively used in many areas of data mining as a black-box method with impressive results. However, understanding the core mechanism of how deep learning makes predictions is a relatively understudied problem. Here we explore the notion of identifying a backbone of deep learning for a given group of instances. A group here can be instances of the same class or even misclassified instances of the same class. We view each instance for a given group as activating a subset of neurons and attempt to find a subgraph of neurons associated with a given concept/group. We formulate this problem as a set cover style problem and show it is intractable and presents a highly constrained integer linear programming (ILP) formulation. As an alternative, we explore a coverage-based heuristic approach related to pattern mining, and show it converges to a Pareto equilibrium point of the ILP formulation. Experimentally we explore these backbones to identify mistakes and improve performance, explanation, and visualization. We demonstrate application-based results using several challenging data sets, including Bird Audio Detection (BAD) Challenge and Labeled Faces in the Wild (LFW), as well as the classic MNIST data.
- [346] arXiv:2403.18338 [ pdf , ps , html , other ]
-
Title: mALBERT: Is a Compact Multilingual BERT Model Still Worth It?Comments: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, May 2024, Torino, ItalySubjects: Artificial Intelligence (cs.AI)
Abstract: Within the current trend of Pretained Language Models (PLM), emerge more and more criticisms about the ethical andecological impact of such models. In this article, considering these critical remarks, we propose to focus on smallermodels, such as compact models like ALBERT, which are more ecologically virtuous than these PLM. However,PLMs enable huge breakthroughs in Natural Language Processing tasks, such as Spoken and Natural LanguageUnderstanding, classification, Question--Answering tasks. PLMs also have the advantage of being multilingual, and,as far as we know, a multilingual version of compact ALBERT models does not exist. Considering these facts, wepropose the free release of the first version of a multilingual compact ALBERT model, pre-trained using Wikipediadata, which complies with the ethical aspect of such a language model. We also evaluate the model against classicalmultilingual PLMs in classical NLP tasks. Finally, this paper proposes a rare study on the subword tokenizationimpact on language performances.
- [347] arXiv:2403.18344 [ pdf , ps , html , other ]
-
Title: LC-LLM: Explainable Lane-Change Intention and Trajectory Predictions with Large Language ModelsMingxing Peng , Xusen Guo , Xianda Chen , Meixin Zhu , Kehua Chen , Hao (Frank) Yang , Xuesong Wang , Yinhai WangSubjects: Artificial Intelligence (cs.AI)
Abstract: To ensure safe driving in dynamic environments, autonomous vehicles should possess the capability to accurately predict the lane change intentions of surrounding vehicles in advance and forecast their future trajectories. Existing motion prediction approaches have ample room for improvement, particularly in terms of long-term prediction accuracy and interpretability. In this paper, we address these challenges by proposing LC-LLM, an explainable lane change prediction model that leverages the strong reasoning capabilities and self-explanation abilities of Large Language Models (LLMs). Essentially, we reformulate the lane change prediction task as a language modeling problem, processing heterogeneous driving scenario information in natural language as prompts for input into the LLM and employing a supervised fine-tuning technique to tailor the LLM specifically for our lane change prediction task. This allows us to utilize the LLM's powerful common sense reasoning abilities to understand complex interactive information, thereby improving the accuracy of long-term predictions. Furthermore, we incorporate explanatory requirements into the prompts in the inference stage. Therefore, our LC-LLM model not only can predict lane change intentions and trajectories but also provides explanations for its predictions, enhancing the interpretability. Extensive experiments on the large-scale highD dataset demonstrate the superior performance and interpretability of our LC-LLM in lane change prediction task. To the best of our knowledge, this is the first attempt to utilize LLMs for predicting lane change behavior. Our study shows that LLMs can encode comprehensive interaction information for driving behavior understanding.
- [348] arXiv:2403.18388 [ pdf , ps , html , other ]
-
Title: FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN ConversionSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV)
Abstract: Spiking Neural Networks (SNNs) offer a promising avenue for energy-efficient computing compared with Artificial Neural Networks (ANNs), closely mirroring biological neural processes. However, this potential comes with inherent challenges in directly training SNNs through spatio-temporal backpropagation -- stemming from the temporal dynamics of spiking neurons and their discrete signal processing -- which necessitates alternative ways of training, most notably through ANN-SNN conversion. In this work, we introduce a lightweight Forward Temporal Bias Correction (FTBC) technique, aimed at enhancing conversion accuracy without the computational overhead. We ground our method on provided theoretical findings that through proper temporal bias calibration the expected error of ANN-SNN conversion can be reduced to be zero after each time step. We further propose a heuristic algorithm for finding the temporal bias only in the forward pass, thus eliminating the computational burden of backpropagation and we evaluate our method on CIFAR-10/100 and ImageNet datasets, achieving a notable increase in accuracy on all datasets. Codes are released at a GitHub repository.
- [349] arXiv:2403.18405 [ pdf , ps , html , other ]
-
Title: Leveraging Large Language Models for Relevance Judgments in Legal Case RetrievalSubjects: Artificial Intelligence (cs.AI)
Abstract: Collecting relevant judgments for legal case retrieval is a challenging and time-consuming task. Accurately judging the relevance between two legal cases requires a considerable effort to read the lengthy text and a high level of domain expertise to extract Legal Facts and make juridical judgments. With the advent of advanced large language models, some recent studies have suggested that it is promising to use LLMs for relevance judgment. Nonetheless, the method of employing a general large language model for reliable relevance judgments in legal case retrieval is yet to be thoroughly explored. To fill this research gap, we devise a novel few-shot workflow tailored to the relevant judgment of legal cases. The proposed workflow breaks down the annotation process into a series of stages, imitating the process employed by human annotators and enabling a flexible integration of expert reasoning to enhance the accuracy of relevance judgments. By comparing the relevance judgments of LLMs and human experts, we empirically show that we can obtain reliable relevance judgments with the proposed workflow. Furthermore, we demonstrate the capacity to augment existing legal case retrieval models through the synthesis of data generated by the large language model.
- [350] arXiv:2403.18489 [ pdf , ps , other ]
-
Title: Impact of Employing Weather Forecast Data as Input to the Estimation of Evapotranspiration by Deep Neural Network ModelsComments: A partial version of the work submitted to ESRE/INTERNATIONAL CONFERENCE ON ENVIRONMENTAL SCIENCES AND RENEWABLE ENERGYSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: Reference Evapotranspiration (ET0) is a key parameter for designing smart irrigation scheduling, since it is related by a coefficient to the water needs of a crop. The United Nations Food and Agriculture Organization, proposed a standard method for ET0 computation (FAO56PM), based on the parameterization of the Penman-Monteith equation, that is widely adopted in the literature. To compute ET0 using the FAO56-PM method, four main weather parameters are needed: temperature, humidity, wind, and solar radiation (SR). One way to make daily ET0 estimations for future days is to use freely available weather forecast services (WFSs), where many meteorological parameters are estimated up to the next 15 days. A problem with this method is that currently, SR is not provided as a free forecast parameter on most of those online services or, normally, such forecasts present a financial cost penalty. For this reason, several ET0 estimation models using machine and deep learning were developed and presented in the literature, that use as input features a reduced set of carefully selected weather parameters, that are compatible with common freely available WFSs. However, most studies on this topic have only evaluated model performance using data from weather stations (WSs), without considering the effect of using weather forecast data. In this study, the performance of authors' previous models is evaluated when using weather forecast data from two online WFSs, in the following scenarios: (i) direct ET0 estimation by an ANN model, and (ii) estimate SR by ANN model, and then use that estimation for ET0 computation, using the FAO56-PM method. Employing data collected from two WFSs and a WS located in Vale do Lobo, Portugal, the latter approach achieved the best result, with a coefficient of determination (R2) ranging between 0.893 and 0.667, when considering forecasts up to 15 days.
- [351] arXiv:2403.18537 [ pdf , ps , other ]
-
Title: A Path Towards Legal Autonomy: An interoperable and explainable approach to extracting, transforming, loading and computing legal information using large language models, expert systems and Bayesian networksAxel Constant , Hannes Westermann , Bryan Wilson , Alex Kiefer , Ines Hipolito , Sylvain Pronovost , Steven Swanson , Mahault Albarracin , Maxwell J.D. RamsteadSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
Abstract: Legal autonomy - the lawful activity of artificial intelligence agents - can be achieved in one of two ways. It can be achieved either by imposing constraints on AI actors such as developers, deployers and users, and on AI resources such as data, or by imposing constraints on the range and scope of the impact that AI agents can have on the environment. The latter approach involves encoding extant rules concerning AI driven devices into the software of AI agents controlling those devices (e.g., encoding rules about limitations on zones of operations into the agent software of an autonomous drone device). This is a challenge since the effectivity of such an approach requires a method of extracting, loading, transforming and computing legal information that would be both explainable and legally interoperable, and that would enable AI agents to reason about the law. In this paper, we sketch a proof of principle for such a method using large language models (LLMs), expert legal systems known as legal decision paths, and Bayesian networks. We then show how the proposed method could be applied to extant regulation in matters of autonomous cars, such as the California Vehicle Code.
- [352] arXiv:2403.18547 [ pdf , ps , html , other ]
-
Title: Neural Architecture Search for Sentence Classification with BERTSubjects: Artificial Intelligence (cs.AI)
Abstract: Pre training of language models on large text corpora is common practice in Natural Language Processing. Following, fine tuning of these models is performed to achieve the best results on a variety of tasks. In this paper we question the common practice of only adding a single output layer as a classification head on top of the network. We perform an AutoML search to find architectures that outperform the current single layer at only a small compute cost. We validate our classification architecture on a variety of NLP benchmarks from the GLUE dataset.
- [353] arXiv:2403.18659 [ pdf , ps , html , other ]
-
Title: INEXA: Interactive and Explainable Process Model Abstraction Through Object-Centric Process MiningSubjects: Artificial Intelligence (cs.AI)
Abstract: Process events are recorded by multiple information systems at different granularity levels. Based on the resulting event logs, process models are discovered at different granularity levels, as well. Events stored at a fine-grained granularity level, for example, may hinder the discovered process model to be displayed due the high number of resulting model elements. The discovered process model of a real-world manufacturing process, for example, consists of 1,489 model elements and over 2,000 arcs. Existing process model abstraction techniques could help reducing the size of the model, but would disconnect it from the underlying event log. Existing event abstraction techniques do neither support the analysis of mixed granularity levels, nor interactive exploration of a suitable granularity level. To enable the exploration of discovered process models at different granularity levels, we propose INEXA, an interactive, explainable process model abstraction method that keeps the link to the event log. As a starting point, INEXA aggregates large process models to a "displayable" size, e.g., for the manufacturing use case to a process model with 58 model elements. Then, the process analyst can explore granularity levels interactively, while applied abstractions are automatically traced in the event log for explainability.
- [354] arXiv:2403.18725 [ pdf , ps , html , other ]
-
Title: Probabilistic Model Checking of Stochastic Reinforcement Learning PoliciesSubjects: Artificial Intelligence (cs.AI)
Abstract: We introduce a method to verify stochastic reinforcement learning (RL) policies. This approach is compatible with any RL algorithm as long as the algorithm and its corresponding environment collectively adhere to the Markov property. In this setting, the future state of the environment should depend solely on its current state and the action executed, independent of any previous states or actions. Our method integrates a verification technique, referred to as model checking, with RL, leveraging a Markov decision process, a trained RL policy, and a probabilistic computation tree logic (PCTL) formula to build a formal model that can be subsequently verified via the model checker Storm. We demonstrate our method's applicability across multiple benchmarks, comparing it to baseline methods called deterministic safety estimates and naive monolithic model checking. Our results show that our method is suited to verify stochastic RL policies.
- [355] arXiv:2403.18731 [ pdf , ps , html , other ]
-
Title: Enhancing Manufacturing Quality Prediction Models through the Integration of Explainability MethodsSubjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: This research presents a method that utilizes explainability techniques to amplify the performance of machine learning (ML) models in forecasting the quality of milling processes, as demonstrated in this paper through a manufacturing use case. The methodology entails the initial training of ML models, followed by a fine-tuning phase where irrelevant features identified through explainability methods are eliminated. This procedural refinement results in performance enhancements, paving the way for potential reductions in manufacturing costs and a better understanding of the trained ML models. This study highlights the usefulness of explainability techniques in both explaining and optimizing predictive models in the manufacturing realm.
- [356] arXiv:2403.18827 [ pdf , ps , html , other ]
-
Title: Bridging Generative Networks with the Common Model of CognitionRobert L. West , Spencer Eckler , Brendan Conway-Smith , Nico Turcas , Eilene Tomkins-Flanagan , Mary Alexandria KellySubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
Abstract: This article presents a theoretical framework for adapting the Common Model of Cognition to large generative network models within the field of artificial intelligence. This can be accomplished by restructuring modules within the Common Model into shadow production systems that are peripheral to a central production system, which handles higher-level reasoning based on the shadow productions' output. Implementing this novel structure within the Common Model allows for a seamless connection between cognitive architectures and generative neural networks.
- [357] arXiv:2403.19760 [ pdf , ps , html , other ]
-
Title: Leveraging Counterfactual Paths for Contrastive Explanations of POMDP PoliciesComments: 5 pages, 1 figureSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC)
Abstract: As humans come to rely on autonomous systems more, ensuring the transparency of such systems is important to their continued adoption. Explainable Artificial Intelligence (XAI) aims to reduce confusion and foster trust in systems by providing explanations of agent behavior. Partially observable Markov decision processes (POMDPs) provide a flexible framework capable of reasoning over transition and state uncertainty, while also being amenable to explanation. This work investigates the use of user-provided counterfactuals to generate contrastive explanations of POMDP policies. Feature expectations are used as a means of contrasting the performance of these policies. We demonstrate our approach in a Search and Rescue (SAR) setting. We analyze and discuss the associated challenges through two case studies.
- [358] arXiv:2403.19790 [ pdf , ps , html , other ]
-
Title: Bespoke Large Language Models for Digital Triage Assistance in Mental Health CareSubjects: Artificial Intelligence (cs.AI)
Abstract: Contemporary large language models (LLMs) may have utility for processing unstructured, narrative free-text clinical data contained in electronic health records (EHRs) -- a particularly important use-case for mental health where a majority of routinely-collected patient data lacks structured, machine-readable content.
A significant problem for the the United Kingdom's National Health Service (NHS) are the long waiting lists for specialist mental healthcare. According to NHS data, in each month of 2023, there were between 370,000 and 470,000 individual new referrals into secondary mental healthcare services. Referrals must be triaged by clinicians, using clinical information contained in the patient's EHR to arrive at a decision about the most appropriate mental healthcare team to assess and potentially treat these patients.
The ability to efficiently recommend a relevant team by ingesting potentially voluminous clinical notes could help services both reduce referral waiting times and with the right technology, improve the evidence available to justify triage decisions.
We present and evaluate three different approaches for LLM-based, end-to-end ingestion of variable-length clinical EHR data to assist clinicians when triaging referrals. Our model is able to deliver triage recommendations consistent with existing clinical practices and it's architecture was implemented on a single GPU, making it practical for implementation in resource-limited NHS environments where private implementations of LLM technology will be necessary to ensure confidential clinical data is appropriately controlled and governed. - [359] arXiv:2403.19826 [ pdf , ps , html , other ]
-
Title: Segmentation Re-thinking Uncertainty Estimation Metrics for Semantic SegmentationComments: Premature Submission: accidentally submitted before it was readySubjects: Artificial Intelligence (cs.AI)
Abstract: In the domain of computer vision, semantic segmentation emerges as a fundamental application within machine learning, wherein individual pixels of an image are classified into distinct semantic categories. This task transcends traditional accuracy metrics by incorporating uncertainty quantification, a critical measure for assessing the reliability of each segmentation prediction. Such quantification is instrumental in facilitating informed decision-making, particularly in applications where precision is paramount. Within this nuanced framework, the metric known as PAvPU (Patch Accuracy versus Patch Uncertainty) has been developed as a specialized tool for evaluating entropy-based uncertainty in image segmentation tasks. However, our investigation identifies three core deficiencies within the PAvPU framework and proposes robust solutions aimed at refining the metric. By addressing these issues, we aim to enhance the reliability and applicability of uncertainty quantification, especially in scenarios that demand high levels of safety and accuracy, thus contributing to the advancement of semantic segmentation methodologies in critical applications.
- [360] arXiv:2403.19856 [ pdf , ps , html , other ]
-
Title: Towards a Brazilian History Knowledge GraphSubjects: Artificial Intelligence (cs.AI) ; Digital Libraries (cs.DL)
Abstract: This short paper describes the first steps in a project to construct a knowledge graph for Brazilian history based on the Brazilian Dictionary of Historical Biographies (DHBB) and Wikipedia/Wikidata. We contend that large repositories of Brazilian-named entities (people, places, organizations, and political events and movements) would be beneficial for extracting information from Portuguese texts. We show that many of the terms/entities described in the DHBB do not have corresponding concepts (or Q items) in Wikidata, the largest structured database of entities associated with Wikipedia. We describe previous work on extracting information from the DHBB and outline the steps to construct a Wikidata-based historical knowledge graph.
- [361] arXiv:2403.19857 [ pdf , ps , html , other ]
-
Title: LLMSense: Harnessing LLMs for High-level Reasoning Over Spatiotemporal Sensor TracesComments: 6 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: Most studies on machine learning in sensing systems focus on low-level perception tasks that process raw sensory data within a short time window. However, many practical applications, such as human routine modeling and occupancy tracking, require high-level reasoning abilities to comprehend concepts and make inferences based on long-term sensor traces. Existing machine learning-based approaches for handling such complex tasks struggle to generalize due to the limited training samples and the high dimensionality of sensor traces, necessitating the integration of human knowledge for designing first-principle models or logic reasoning methods. We pose a fundamental question: Can we harness the reasoning capabilities and world knowledge of Large Language Models (LLMs) to recognize complex events from long-term spatiotemporal sensor traces? To answer this question, we design an effective prompting framework for LLMs on high-level reasoning tasks, which can handle traces from the raw sensor data as well as the low-level perception results. We also design two strategies to enhance performance with long sensor traces, including summarization before reasoning and selective inclusion of historical traces. Our framework can be implemented in an edge-cloud setup, running small LLMs on the edge for data summarization and performing high-level reasoning on the cloud for privacy preservation. The results show that LLMSense can achieve over 80\% accuracy on two high-level reasoning tasks such as dementia diagnosis with behavior traces and occupancy tracking with environmental sensor traces. This paper provides a few insights and guidelines for leveraging LLM for high-level reasoning on sensor traces and highlights several directions for future work.
- [362] arXiv:2403.19881 [ pdf , ps , html , other ]
-
Title: IME: Integrating Multi-curvature Shared and Specific Embedding for Temporal Knowledge Graph CompletionSubjects: Artificial Intelligence (cs.AI)
Abstract: Temporal Knowledge Graphs (TKGs) incorporate a temporal dimension, allowing for a precise capture of the evolution of knowledge and reflecting the dynamic nature of the real world. Typically, TKGs contain complex geometric structures, with various geometric structures interwoven. However, existing Temporal Knowledge Graph Completion (TKGC) methods either model TKGs in a single space or neglect the heterogeneity of different curvature spaces, thus constraining their capacity to capture these intricate geometric structures. In this paper, we propose a novel Integrating Multi-curvature shared and specific Embedding (IME) model for TKGC tasks. Concretely, IME models TKGs into multi-curvature spaces, including hyperspherical, hyperbolic, and Euclidean spaces. Subsequently, IME incorporates two key properties, namely space-shared property and space-specific property. The space-shared property facilitates the learning of commonalities across different curvature spaces and alleviates the spatial gap caused by the heterogeneous nature of multi-curvature spaces, while the space-specific property captures characteristic features. Meanwhile, IME proposes an Adjustable Multi-curvature Pooling (AMP) approach to effectively retain important information. Furthermore, IME innovatively designs similarity, difference, and structure loss functions to attain the stated objective. Experimental results clearly demonstrate the superior performance of IME over existing state-of-the-art TKGC models.
- [363] arXiv:2403.19883 [ pdf , ps , html , other ]
-
Title: Policy-Space Search: Equivalences, Improvements, and CompressionSubjects: Artificial Intelligence (cs.AI)
Abstract: Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. A* with Non-Determinism (AND*) (Messa and Pereira, 2023) is a FOND planner that generalizes A* (Hart et al., 1968) for FOND planning. It searches for a solution policy by performing an explicit heuristic search on the policy space of the FOND task. In this paper, we study and improve the performance of the policy-space search performed by AND*. We present a polynomial-time procedure that constructs a solution policy given just the set of states that should be mapped. This procedure, together with a better understanding of the structure of FOND policies, allows us to present three concepts of equivalences between policies. We use policy equivalences to prune part of the policy search space, making AND* substantially more effective in solving FOND tasks. We also study the impact of taking into account structural state-space symmetries to strengthen the detection of equivalence policies and the impact of performing the search with satisficing techniques. We apply a recent technique from the group theory literature to better compute structural state-space symmetries. Finally, we present a solution compressor that, given a policy defined over complete states, finds a policy that unambiguously represents it using the minimum number of partial states. AND* with the introduced techniques generates, on average, two orders of magnitude fewer policies to solve FOND tasks. These techniques allow explicit policy-space search to be competitive in terms of both coverage and solution compactness with other state-of-the-art FOND planners.
- [364] arXiv:2403.19941 [ pdf , ps , html , other ]
-
Title: Diverse Feature Learning by Self-distillation and ResetComments: 15 pages, 6 FiguresSubjects: Artificial Intelligence (cs.AI)
Abstract: Our paper addresses the problem of models struggling to learn diverse features, due to either forgetting previously learned features or failing to learn new ones. To overcome this problem, we introduce Diverse Feature Learning (DFL), a method that combines an important feature preservation algorithm with a new feature learning algorithm. Specifically, for preserving important features, we utilize self-distillation in ensemble models by selecting the meaningful model weights observed during training. For learning new features, we employ reset that involves periodically re-initializing part of the model. As a result, through experiments with various models on the image classification, we have identified the potential for synergistic effects between self-distillation and reset.
- [365] arXiv:2403.19992 [ pdf , ps , html , other ]
-
Title: MindArm: Mechanized Intelligent Non-Invasive Neuro-Driven Prosthetic Arm SystemComments: 8 pages, 21 figures, paper submitted to IROS 24, authors affiliated to NYUADSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Abstract: Currently, people with disability or difficulty to move their arms (referred to as "patients") have very limited technological solutions to efficiently address their physiological limitations. It is mainly due to two reasons: (1) the non-invasive solutions like mind-controlled prosthetic devices are typically very costly and require expensive maintenance; and (2) other solutions require costly invasive brain surgery, which is high risk to perform, expensive, and difficult to maintain. Therefore, current technological solutions are not accessible for all patients with different financial backgrounds. Toward this, we propose a low-cost technological solution called MindArm, a mechanized intelligent non-invasive neuro-driven prosthetic arm system. Our MindArm system employs a deep neural network (DNN) engine to translate brain signals into the intended prosthetic arm motion, thereby helping patients to perform many activities despite their physiological limitations. Here, our MindArm system utilizes widely accessible and low-cost surface electroencephalogram (EEG) electrodes coupled with an Open Brain Computer Interface and UDP networking for acquiring brain signals and transmitting them to the compute module for signal processing. In the compute module, we run a trained DNN model to interpret normalized micro-voltage of the brain signals, and then translate them into a prosthetic arm action via serial communication seamlessly. The experimental results on a fully working prototype demonstrate that, from the three defined actions, our MindArm system achieves positive success rates, i.e., 91\% for idle/stationary, 85\% for shake hand, and 84\% for pick-up cup. This demonstrates that our MindArm provides a novel approach for an alternate low-cost mind-controlled prosthetic devices for all patients.
- [366] arXiv:2403.19995 [ pdf , ps , html , other ]
-
Title: Development of Compositionality and Generalization through Interactive Learning of Language and Action of RobotsComments: 59 pages, 6 figures, 10 supplementary figuresSubjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Robotics (cs.RO)
Abstract: Humans excel at applying learned behavior to unlearned situations. A crucial component of this generalization behavior is our ability to compose/decompose a whole into reusable parts, an attribute known as compositionality. One of the fundamental questions in robotics concerns this characteristic. "How can linguistic compositionality be developed concomitantly with sensorimotor skills through associative learning, particularly when individuals only learn partial linguistic compositions and their corresponding sensorimotor patterns?" To address this question, we propose a brain-inspired neural network model that integrates vision, proprioception, and language into a framework of predictive coding and active inference, based on the free-energy principle. The effectiveness and capabilities of this model were assessed through various simulation experiments conducted with a robot arm. Our results show that generalization in learning to unlearned verb-noun compositions, is significantly enhanced when training variations of task composition are increased. We attribute this to self-organized compositional structures in linguistic latent state space being influenced significantly by sensorimotor learning. Ablation studies show that visual attention and working memory are essential to accurately generate visuo-motor sequences to achieve linguistically represented goals. These insights advance our understanding of mechanisms underlying development of compositionality through interactions of linguistic and sensorimotor experience.
- [367] arXiv:2403.20089 [ pdf , ps , html , other ]
-
Title: Implications of the AI Act for Non-Discrimination Law and Algorithmic FairnessSubjects: Artificial Intelligence (cs.AI)
Abstract: The topic of fairness in AI, as debated in the FATE (Fairness, Accountability, Transparency, and Ethics in AI) communities, has sparked meaningful discussions in the past years. However, from a legal perspective, particularly from European Union law, many open questions remain. Whereas algorithmic fairness aims to mitigate structural inequalities at the design level, European non-discrimination law is tailored to individual cases of discrimination after an AI model has been deployed. The AI Act might present a tremendous step towards bridging these two concepts by shifting non-discrimination responsibilities into the design stage of AI models. Based on an integrative reading of the AI Act, we comment on legal as well as technical enforcement problems and propose practical implications on bias detection and bias correction in order to specify and comply with specific technical requirements.
- [368] arXiv:2403.20097 [ pdf , ps , html , other ]
-
Title: ITCMA: A Generative Agent Based on a Computational Consciousness StructureComments: 20 pages, 11 figuresSubjects: Artificial Intelligence (cs.AI) ; Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
Abstract: Large Language Models (LLMs) still face challenges in tasks requiring understanding implicit instructions and applying common-sense knowledge. In such scenarios, LLMs may require multiple attempts to achieve human-level performance, potentially leading to inaccurate responses or inferences in practical environments, affecting their long-term consistency and behavior. This paper introduces the Internal Time-Consciousness Machine (ITCM), a computational consciousness structure. We further propose the ITCM-based Agent (ITCMA), which supports behavior generation and reasoning in open-world settings. ITCMA enhances LLMs' ability to understand implicit instructions and apply common-sense knowledge by considering agents' interaction and reasoning with the environment. Evaluations in the Alfworld environment show that trained ITCMA outperforms the state-of-the-art (SOTA) by 9% on the seen set. Even untrained ITCMA achieves a 96% task completion rate on the seen set, 5% higher than SOTA, indicating its superiority over traditional intelligent agents in utility and generalization. In real-world tasks with quadruped robots, the untrained ITCMA achieves an 85% task completion rate, which is close to its performance in the unseen set, demonstrating its comparable utility in real-world settings.
- [369] arXiv:2403.20127 [ pdf , ps , html , other ]
-
Title: The Impact of Prompts on Zero-Shot Detection of AI-Generated TextSubjects: Artificial Intelligence (cs.AI)
Abstract: In recent years, there have been significant advancements in the development of Large Language Models (LLMs). While their practical applications are now widespread, their potential for misuse, such as generating fake news and committing plagiarism, has posed significant concerns. To address this issue, detectors have been developed to evaluate whether a given text is human-generated or AI-generated. Among others, zero-shot detectors stand out as effective approaches that do not require additional training data and are often likelihood-based. In chat-based applications, users commonly input prompts and utilize the AI-generated texts. However, zero-shot detectors typically analyze these texts in isolation, neglecting the impact of the original prompts. It is conceivable that this approach may lead to a discrepancy in likelihood assessments between the text generation phase and the detection phase. So far, there remains an unverified gap concerning how the presence or absence of prompts impacts detection accuracy for zero-shot detectors. In this paper, we introduce an evaluative framework to empirically analyze the impact of prompts on the detection accuracy of AI-generated text. We assess various zero-shot detectors using both white-box detection, which leverages the prompt, and black-box detection, which operates without prompt information. Our experiments reveal the significant influence of prompts on detection accuracy. Remarkably, compared with black-box detection without prompts, the white-box methods using prompts demonstrate an increase in AUC of at least $0.1$ across all zero-shot detectors tested. Code is available: \url{ this https URL }.
- [370] arXiv:2403.20137 [ pdf , ps , html , other ]
-
Title: Accurate Block Quantization in LLMs with OutliersSubjects: Artificial Intelligence (cs.AI) ; Hardware Architecture (cs.AR); Numerical Analysis (math.NA)
Abstract: The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the KV-cache of size proportional to the sequence length. To make the required compute feasible and fit the involved data into available memory, numerous quantization techniques have been proposed that allow accurate quantization for both weights and activations. One of the main recent breakthroughs in this direction was introduction of the family of Block Floating Point (BFP) formats characterized by a block of mantissas with a shared scale factor. These enable memory- power-, and compute- efficient hardware support of the tensor operations and provide extremely good quantization accuracy. The main issues preventing widespread application of block formats is caused by the presence of outliers in weights and activations since those affect the accuracy of the other values in the same block. In this paper, we focus on the most critical problem of limited KV-cache storage. We propose a novel approach enabling usage of low precision BFP formats without compromising the resulting model accuracy. We exploit the common channel-wise patterns exhibited by the outliers to rearrange them in such a way, that their quantization quality is significantly improved. The methodology yields 2x savings in the memory footprint without significant degradation of the model's accuracy. Importantly, the rearrangement of channels happens at the compile time and thus has no impact on the inference latency.
- [371] arXiv:2403.20151 [ pdf , ps , other ]
-
Title: A Learning-based Incentive Mechanism for Mobile AIGC Service in Decentralized Internet of VehiclesComments: 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall)Subjects: Artificial Intelligence (cs.AI)
Abstract: Artificial Intelligence-Generated Content (AIGC) refers to the paradigm of automated content generation utilizing AI models. Mobile AIGC services in the Internet of Vehicles (IoV) network have numerous advantages over traditional cloud-based AIGC services, including enhanced network efficiency, better reconfigurability, and stronger data security and privacy. Nonetheless, AIGC service provisioning frequently demands significant resources. Consequently, resource-constrained roadside units (RSUs) face challenges in maintaining a heterogeneous pool of AIGC services and addressing all user service requests without degrading overall performance. Therefore, in this paper, we propose a decentralized incentive mechanism for mobile AIGC service allocation, employing multi-agent deep reinforcement learning to find the balance between the supply of AIGC services on RSUs and user demand for services within the IoV context, optimizing user experience and minimizing transmission latency. Experimental results demonstrate that our approach achieves superior performance compared to other baseline models.
- [372] arXiv:2403.20177 [ pdf , ps , other ]
-
Title: Artificial consciousness. Some logical and conceptual preliminariesK. Evers , M. Farisco , R. Chatila , B. D. Earp , I. T. Freire , F. Hamker , E. Nemeth , P. F. M. J. Verschure , M. KhamassiSubjects: Artificial Intelligence (cs.AI) ; Robotics (cs.RO); Neurons and Cognition (q-bio.NC)
Abstract: Is artificial consciousness theoretically possible? Is it plausible? If so, is it technically feasible? To make progress on these questions, it is necessary to lay some groundwork clarifying the logical and empirical conditions for artificial consciousness to arise and the meaning of relevant terms involved. Consciousness is a polysemic word: researchers from different fields, including neuroscience, Artificial Intelligence, robotics, and philosophy, among others, sometimes use different terms in order to refer to the same phenomena or the same terms to refer to different phenomena. In fact, if we want to pursue artificial consciousness, a proper definition of the key concepts is required. Here, after some logical and conceptual preliminaries, we argue for the necessity of using dimensions and profiles of consciousness for a balanced discussion about their possible instantiation or realisation in artificial systems. Our primary goal in this paper is to review the main theoretical questions that arise in the domain of artificial consciousness. On the basis of this review, we propose to assess the issue of artificial consciousness within a multidimensional account. The theoretical possibility of artificial consciousness is already presumed within some theoretical frameworks; however, empirical possibility cannot simply be deduced from these frameworks but needs independent empirical validation. We break down the complexity of consciousness by identifying constituents, components, and dimensions, and reflect pragmatically about the general challenges confronting the creation of artificial consciousness. Despite these challenges, we outline a research strategy for showing how "awareness" as we propose to understand it could plausibly be realised in artificial systems.
- [373] arXiv:2403.20204 [ pdf , ps , html , other ]
-
Title: The Future of Combating Rumors? Retrieval, Discrimination, and GenerationComments: 8 pagesSubjects: Artificial Intelligence (cs.AI)
Abstract: Artificial Intelligence Generated Content (AIGC) technology development has facilitated the creation of rumors with misinformation, impacting societal, economic, and political ecosystems, challenging democracy. Current rumor detection efforts fall short by merely labeling potentially misinformation (classification task), inadequately addressing the issue, and it is unrealistic to have authoritative institutions debunk every piece of information on social media. Our proposed comprehensive debunking process not only detects rumors but also provides explanatory generated content to refute the authenticity of the information. The Expert-Citizen Collective Wisdom (ECCW) module we designed aensures high-precision assessment of the credibility of information and the retrieval module is responsible for retrieving relevant knowledge from a Real-time updated debunking database based on information keywords. By using prompt engineering techniques, we feed results and knowledge into a LLM (Large Language Model), achieving satisfactory discrimination and explanatory effects while eliminating the need for fine-tuning, saving computational costs, and contributing to debunking efforts.
- [374] arXiv:2403.20212 [ pdf , ps , html , other ]
-
Title: On Size and Hardness Generalization in Unsupervised Learning for the Travelling Salesman ProblemSubjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG)
Abstract: We study the generalization capability of Unsupervised Learning in solving the Travelling Salesman Problem (TSP). We use a Graph Neural Network (GNN) trained with a surrogate loss function to generate an embedding for each node. We use these embeddings to construct a heat map that indicates the likelihood of each edge being part of the optimal route. We then apply local search to generate our final predictions. Our investigation explores how different training instance sizes, embedding dimensions, and distributions influence the outcomes of Unsupervised Learning methods. Our results show that training with larger instance sizes and increasing embedding dimensions can build a more effective representation, enhancing the model's ability to solve TSP. Furthermore, in evaluating generalization across different distributions, we first determine the hardness of various distributions and explore how different hardnesses affect the final results. Our findings suggest that models trained on harder instances exhibit better generalization capabilities, highlighting the importance of selecting appropriate training instances in solving TSP using Unsupervised Learning.
- [375] arXiv:2403.20234 [ pdf , ps , html , other ]
-
Title: Artificial Neural Networks-based Real-time Classification of ENG Signals for Implanted Nerve InterfacesSubjects: Artificial Intelligence (cs.AI)
Abstract: Neuropathies are gaining higher relevance in clinical settings, as they risk permanently jeopardizing a person's life. To support the recovery of patients, the use of fully implanted devices is emerging as one of the most promising solutions. However, these devices, even if becoming an integral part of a fully complex neural nanonetwork system, pose numerous challenges. In this article, we address one of them, which consists of the classification of motor/sensory stimuli. The task is performed by exploring four different types of artificial neural networks (ANNs) to extract various sensory stimuli from the electroneurographic (ENG) signal measured in the sciatic nerve of rats. Different sizes of the data sets are considered to analyze the feasibility of the investigated ANNs for real-time classification through a comparison of their performance in terms of accuracy, F1-score, and prediction time. The design of the ANNs takes advantage of the modelling of the ENG signal as a multiple-input multiple-output (MIMO) system to describe the measures taken by state-of-the-art implanted nerve interfaces. These are based on the use of multi-contact cuff electrodes to achieve nanoscale spatial discrimination of the nerve activity. The MIMO ENG signal model is another contribution of this paper. Our results show that some ANNs are more suitable for real-time applications, being capable of achieving accuracies over $90\%$ for signal windows of $100$ and $200\,$ms with a low enough processing time to be effective for pathology recovery.
- [376] arXiv:2403.20306 [ pdf , ps , html , other ]
-
Title: Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM InferenceComments: 6 pages, 15 figuresSubjects: Artificial Intelligence (cs.AI) ; Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models. In this paper, we present the trade-offs brought up by making energy efficiency the primary goal of LLM serving under performance SLOs. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient. We characterize the impact of these knobs on the latency, throughput, as well as the energy. By exploring these trade-offs, we offer valuable insights into optimizing energy usage without compromising on performance, thereby paving the way for sustainable and cost-effective LLM deployment in data center environments.
- [377] arXiv:2403.00011 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Introducing User Feedback-based Counterfactual Explanations (UFCE)Comments: preprint of paper submitted to IJCIS SpringerSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Machine learning models are widely used in real-world applications. However, their complexity makes it often challenging to interpret the rationale behind their decisions. Counterfactual explanations (CEs) have emerged as a viable solution for generating comprehensible explanations in eXplainable Artificial Intelligence (XAI). CE provides actionable information to users on how to achieve the desired outcome with minimal modifications to the input. However, current CE algorithms usually operate within the entire feature space when optimizing changes to turn over an undesired outcome, overlooking the identification of key contributors to the outcome and disregarding the practicality of the suggested changes. In this study, we introduce a novel methodology, that is named as user feedback-based counterfactual explanation (UFCE), which addresses these limitations and aims to bolster confidence in the provided explanations. UFCE allows for the inclusion of user constraints to determine the smallest modifications in the subset of actionable features while considering feature dependence, and evaluates the practicality of suggested changes using benchmark evaluation metrics. We conducted three experiments with five datasets, demonstrating that UFCE outperforms two well-known CE methods in terms of \textit{proximity}, \textit{sparsity}, and \textit{feasibility}. Reported results indicate that user constraints influence the generation of feasible CEs.
- [378] arXiv:2403.00014 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: GIN-SD: Source Detection in Graphs with Incomplete Nodes via Positional Encoding and Attentive FusionComments: The paper is accepted by AAAI24Subjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Source detection in graphs has demonstrated robust efficacy in the domain of rumor source identification. Although recent solutions have enhanced performance by leveraging deep neural networks, they often require complete user data. In this paper, we address a more challenging task, rumor source detection with incomplete user data, and propose a novel framework, i.e., Source Detection in Graphs with Incomplete Nodes via Positional Encoding and Attentive Fusion (GIN-SD), to tackle this challenge. Specifically, our approach utilizes a positional embedding module to distinguish nodes that are incomplete and employs a self-attention mechanism to focus on nodes with greater information transmission capacity. To mitigate the prediction bias caused by the significant disparity between the numbers of source and non-source nodes, we also introduce a class-balancing mechanism. Extensive experiments validate the effectiveness of GIN-SD and its superiority to state-of-the-art methods.
- [379] arXiv:2403.00016 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Sensitivity Analysis for Objective-Oriented Combinatorial OptimizationGanga Gireesan , Nisha Pillai , Michael J Rothrock , Bindu Nanduri , Zhiqian Chen , Mahalingam RamkumarComments: The 2023 International Conference on Computational Science & Computational Intelligence (CSCI'23)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Pathogen control is a critical aspect of modern poultry farming, providing important benefits for both public health and productivity. Effective poultry management measures to reduce pathogen levels in poultry flocks promote food safety by lowering risks of food-borne illnesses. They also support animal health and welfare by preventing infectious diseases that can rapidly spread and impact flock growth, egg production, and overall health. This study frames the search for optimal management practices that minimize the presence of multiple pathogens as a combinatorial optimization problem. Specifically, we model the various possible combinations of management settings as a solution space that can be efficiently explored to identify configurations that optimally reduce pathogen levels. This design incorporates a neural network feedback-based method that combines feature explanations with global sensitivity analysis to ensure combinatorial optimization in multiobjective settings. Our preliminary experiments have promising results when applied to two real-world agricultural datasets. While further validation is still needed, these early experimental findings demonstrate the potential of the model to derive targeted feature interactions that adaptively optimize pathogen control under varying real-world constraints.
- [380] arXiv:2403.00017 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Interpreting Multi-Objective Feature AssociationsNisha Pillai , Ganga Gireesan , Michael J. Rothrock Jr. , Bindu Nanduri , Zhiqian Chen , Mahalingam RamkumarComments: The 18th Annual IEEE International Systems Conference 2024 (IEEE SYSCON 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Understanding how multiple features are associated and contribute to a specific objective is as important as understanding how each feature contributes to a particular outcome. Interpretability of a single feature in a prediction may be handled in multiple ways; however, in a multi-objective prediction, it is difficult to obtain interpretability of a combination of feature values. To address this issue, we propose an objective specific feature interaction design using multi-labels to find the optimal combination of features in agricultural settings. One of the novel aspects of this design is the identification of a method that integrates feature explanations with global sensitivity analysis in order to ensure combinatorial optimization in multi-objective settings. We have demonstrated in our preliminary experiments that an approximate combination of feature values can be found to achieve the desired outcome using two agricultural datasets: one with pre-harvest poultry farm practices for multi-drug resistance presence, and one with post-harvest poultry farm practices for food-borne pathogens. In our combinatorial optimization approach, all three pathogens are taken into consideration simultaneously to account for the interaction between conditions that favor different types of pathogen growth. These results indicate that explanation-based approaches are capable of identifying combinations of features that reduce pathogen presence in fewer iterations than a baseline.
- [381] arXiv:2403.00023 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Auditable Homomorphic-based Decentralized Collaborative AI with Attribute-based Differential PrivacyComments: 12 pages, 9 figuresSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In recent years, the notion of federated learning (FL) has led to the new paradigm of distributed artificial intelligence (AI) with privacy preservation. However, most current FL systems suffer from data privacy issues due to the requirement of a trusted third party. Although some previous works introduce differential privacy to protect the data, however, it may also significantly deteriorate the model performance. To address these issues, we propose a novel decentralized collaborative AI framework, named Auditable Homomorphic-based Decentralised Collaborative AI (AerisAI), to improve security with homomorphic encryption and fine-grained differential privacy. Our proposed AerisAI directly aggregates the encrypted parameters with a blockchain-based smart contract to get rid of the need of a trusted third party. We also propose a brand-new concept for eliminating the negative impacts of differential privacy for model performance. Moreover, the proposed AerisAI also provides the broadcast-aware group key management based on ciphertext-policy attribute-based encryption (CPABE) to achieve fine-grained access control based on different service-level agreements. We provide a formal theoretical analysis of the proposed AerisAI as well as the functionality comparison with the other baselines. We also conduct extensive experiments on real datasets to evaluate the proposed approach. The experimental results indicate that our proposed AerisAI significantly outperforms the other state-of-the-art baselines.
- [382] arXiv:2403.00025 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: On the Challenges and Opportunities in Generative AILaura Manduchi , Kushagra Pandey , Robert Bamler , Ryan Cotterell , Sina Däubener , Sophie Fellenz , Asja Fischer , Thomas Gärtner , Matthias Kirchler , Marius Kloft , Yingzhen Li , Christoph Lippert , Gerard de Melo , Eric Nalisnick , Björn Ommer , Rajesh Ranganath , Maja Rudolph , Karen Ullrich , Guy Van den Broeck , Julia E Vogt , Yixin Wang , Florian Wenzel , Frank Wood , Stephan Mandt , Vincent FortuinSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The field of deep generative modeling has grown rapidly and consistently over the years. With the availability of massive amounts of training data coupled with advances in scalable unsupervised learning paradigms, recent large-scale generative models show tremendous promise in synthesizing high-resolution images and text, as well as structured data such as videos and molecules. However, we argue that current large-scale generative AI models do not sufficiently address several fundamental issues that hinder their widespread adoption across domains. In this work, we aim to identify key unresolved challenges in modern generative AI paradigms that should be tackled to further enhance their capabilities, versatility, and reliability. By identifying these challenges, we aim to provide researchers with valuable insights for exploring fruitful research directions, thereby fostering the development of more robust and accessible generative AI solutions.
- [383] arXiv:2403.00026 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning to Deliver: a Foundation Model for the Montreal Capacitated Vehicle Routing ProblemSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: In this paper, we present the Foundation Model for the Montreal Capacitated Vehicle Routing Problem (FM-MCVRP), a novel Deep Learning (DL) model that approximates high-quality solutions to a variant of the Capacitated Vehicle Routing Problem (CVRP) that characterizes many real-world applications. The so-called Montreal Capacitated Vehicle Routing Problem (MCVRP), first formally described by Bengio et al. (2021), is defined on a fixed and finite graph, which is analogous to a city. Each MCVRP instance is essentially the sub-graph connecting a randomly sampled subset of the nodes in the fixed graph, which represent a set of potential addresses in a real-world delivery problem on a given day. Our work exploits this problem structure to frame the MCVRP as an analogous Natural Language Processing (NLP) task. Specifically, we leverage a Transformer architecture embedded in a Large Language Model (LLM) framework to train our model in a supervised manner on computationally inexpensive, sub-optimal MCVRP solutions obtained algorithmically. Through comprehensive computational experiments, we show that FM-MCVRP produces better MCVRP solutions than the training data and generalizes to larger sized problem instances not seen during training. Even when compared to near-optimal solutions from state-of-the-art heuristics, FM-MCVRP yields competitive results despite being trained on inferior data. For instance, for 400-customer problems, FM-MCVRP solutions on average fall within 2% of the benchmark. Our results further demonstrate that unlike prior works in the literature, FM-MCVRP is a unified model, which performs consistently and reliably on a range of problem instance sizes and parameter values such as the vehicle capacity.
- [384] arXiv:2403.00030 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: GraphPub: Generation of Differential Privacy Graph with High AvailabilitySubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: In recent years, with the rapid development of graph neural networks (GNN), more and more graph datasets have been published for GNN tasks. However, when an upstream data owner publishes graph data, there are often many privacy concerns, because many real-world graph data contain sensitive information like person's friend list. Differential privacy (DP) is a common method to protect privacy, but due to the complex topological structure of graph data, applying DP on graphs often affects the message passing and aggregation of GNN models, leading to a decrease in model accuracy. In this paper, we propose a novel graph edge protection framework, graph publisher (GraphPub), which can protect graph topology while ensuring that the availability of data is basically unchanged. Through reverse learning and the encoder-decoder mechanism, we search for some false edges that do not have a large negative impact on the aggregation of node features, and use them to replace some real edges. The modified graph will be published, which is difficult to distinguish between real and false data. Sufficient experiments prove that our framework achieves model accuracy close to the original graph with an extremely low privacy budget.
- [385] arXiv:2403.00032 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Time to Cite: Modeling Citation Networks using the Dynamic Impact Single-Event Embedding ModelComments: Accepted for AISTATS 2024Subjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
Abstract: Understanding the structure and dynamics of scientific research, i.e., the science of science (SciSci), has become an important area of research in order to address imminent questions including how scholars interact to advance science, how disciplines are related and evolve, and how research impact can be quantified and predicted. Central to the study of SciSci has been the analysis of citation networks. Here, two prominent modeling methodologies have been employed: one is to assess the citation impact dynamics of papers using parametric distributions, and the other is to embed the citation networks in a latent space optimal for characterizing the static relations between papers in terms of their citations. Interestingly, citation networks are a prominent example of single-event dynamic networks, i.e., networks for which each dyad only has a single event (i.e., the point in time of citation). We presently propose a novel likelihood function for the characterization of such single-event networks. Using this likelihood, we propose the Dynamic Impact Single-Event Embedding model (DISEE). The \textsc{\modelabbrev} model characterizes the scientific interactions in terms of a latent distance model in which random effects account for citation heterogeneity while the time-varying impact is characterized using existing parametric representations for assessment of dynamic impact. We highlight the proposed approach on several real citation networks finding that the DISEE well reconciles static latent distance network embedding approaches with classical dynamic impact assessments.
- [386] arXiv:2403.00036 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Influencing Bandits: Arm Selection for Preference ShapingComments: 14 pages, 8 figures, 24 references, proofs in appendixSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Systems and Control (eess.SY)
Abstract: We consider a non stationary multi-armed bandit in which the population preferences are positively and negatively reinforced by the observed rewards. The objective of the algorithm is to shape the population preferences to maximize the fraction of the population favouring a predetermined arm. For the case of binary opinions, two types of opinion dynamics are considered -- decreasing elasticity (modeled as a Polya urn with increasing number of balls) and constant elasticity (using the voter model). For the first case, we describe an Explore-then-commit policy and a Thompson sampling policy and analyse the regret for each of these policies. We then show that these algorithms and their analyses carry over to the constant elasticity case. We also describe a Thompson sampling based algorithm for the case when more than two types of opinions are present. Finally, we discuss the case where presence of multiple recommendation systems gives rise to a trade-off between their popularity and opinion shaping objectives.
- [387] arXiv:2403.00037 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Evolving to the Future: Unseen Event Adaptive Fake News Detection on Social MediaSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: With the rapid development of social media, the wide dissemination of fake news on social media is increasingly threatening both individuals and society. In the dynamic landscape of social media, fake news detection aims to develop a model trained on news reporting past events. The objective is to predict and identify fake news about future events, which often relate to subjects entirely different from those in the past. However, existing fake detection methods exhibit a lack of robustness and cannot generalize to unseen events. To address this, we introduce Future ADaptive Event-based Fake news Detection (FADE) framework. Specifically, we train a target predictor through an adaptive augmentation strategy and graph contrastive learning to make more robust overall predictions. Simultaneously, we independently train an event-only predictor to obtain biased predictions. Then we further mitigate event bias by obtaining the final prediction by subtracting the output of the event-only predictor from the output of the target predictor. Encouraging results from experiments designed to emulate real-world social media conditions validate the effectiveness of our method in comparison to existing state-of-the-art approaches.
- [388] arXiv:2403.00039 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: FhGenie: A Custom, Confidentiality-preserving Chat AI for Corporate and Scientific UseIngo Weber , Hendrik Linka , Daniel Mertens , Tamara Muryshkin , Heinrich Opgenoorth , Stefan LangerSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Since OpenAI's release of ChatGPT, generative AI has received significant attention across various domains. These AI-based chat systems have the potential to enhance the productivity of knowledge workers in diverse tasks. However, the use of free public services poses a risk of data leakage, as service providers may exploit user input for additional training and optimization without clear boundaries. Even subscription-based alternatives sometimes lack transparency in handling user data. To address these concerns and enable Fraunhofer staff to leverage this technology while ensuring confidentiality, we have designed and developed a customized chat AI called FhGenie (genie being a reference to a helpful spirit). Within few days of its release, thousands of Fraunhofer employees started using this service. As pioneers in implementing such a system, many other organizations have followed suit. Our solution builds upon commercial large language models (LLMs), which we have carefully integrated into our system to meet our specific requirements and compliance constraints, including confidentiality and GDPR. In this paper, we share detailed insights into the architectural considerations, design, implementation, and subsequent updates of FhGenie. Additionally, we discuss challenges, observations, and the core lessons learned from its productive usage.
- [389] arXiv:2403.00041 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Global and Local Prompts Cooperation via Optimal Transport for Federated LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Prompt learning in pretrained visual-language models has shown remarkable flexibility across various downstream tasks. Leveraging its inherent lightweight nature, recent research attempted to integrate the powerful pretrained models into federated learning frameworks to simultaneously reduce communication costs and promote local training on insufficient data. Despite these efforts, current federated prompt learning methods lack specialized designs to systematically address severe data heterogeneities, e.g., data distribution with both label and feature shifts involved. To address this challenge, we present Federated Prompts Cooperation via Optimal Transport (FedOTP), which introduces efficient collaborative prompt learning strategies to capture diverse category traits on a per-client basis. Specifically, for each client, we learn a global prompt to extract consensus knowledge among clients, and a local prompt to capture client-specific category characteristics. Unbalanced Optimal Transport is then employed to align local visual features with these prompts, striking a balance between global consensus and local personalization. By relaxing one of the equality constraints, FedOTP enables prompts to focus solely on the core regions of image patches. Extensive experiments on datasets with various types of heterogeneities have demonstrated that our FedOTP outperforms the state-of-the-art methods.
- [390] arXiv:2403.00044 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Scaling up Dynamic Edge Partition Models via Stochastic Gradient MCMCSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The edge partition model (EPM) is a generative model for extracting an overlapping community structure from static graph-structured data. In the EPM, the gamma process (GaP) prior is adopted to infer the appropriate number of latent communities, and each vertex is endowed with a gamma distributed positive memberships vector. Despite having many attractive properties, inference in the EPM is typically performed using Markov chain Monte Carlo (MCMC) methods that prevent it from being applied to massive network data. In this paper, we generalize the EPM to account for dynamic enviroment by representing each vertex with a positive memberships vector constructed using Dirichlet prior specification, and capturing the time-evolving behaviour of vertices via a Dirichlet Markov chain construction. A simple-to-implement Gibbs sampler is proposed to perform posterior computation using Negative- Binomial augmentation technique. For large network data, we propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in the proposed model. The experimental results show that the novel methods achieve competitive performance in terms of link prediction, while being much faster.
- [391] arXiv:2403.00046 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: SEED: Customize Large Language Models with Sample-Efficient Adaptation for Code GenerationSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training samples available in practice lead to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training samples is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named SEED, which stands for Sample-Efficient adaptation with Error-Driven learning for code generation. SEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome its own shortcomings, thus achieving efficient learning. Specifically, SEED involves identifying error code generated by LLMs, employing Self-revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, SEED achieves superior performance with few training samples, showing an average relative improvement of 54.7% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, SEED consistently demonstrates strong performance across various LLMs, underscoring its generalizability.
- [392] arXiv:2403.00071 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Resonance RoPE: Improving Context Length Generalization of Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.
- [393] arXiv:2403.00108 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play ScenarioHongyi Liu , Zirui Liu , Ruixiang Tang , Jiayi Yuan , Shaochen Zhong , Yu-Neng Chuang , Li Li , Rui Chen , Xia HuSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Fine-tuning LLMs is crucial to enhancing their task-specific performance and ensuring model behaviors are aligned with human preferences. Among various fine-tuning methods, LoRA is popular for its efficiency and ease to use, allowing end-users to easily post and adopt lightweight LoRA modules on open-source platforms to tailor their model for different customization. However, such a handy share-and-play setting opens up new attack surfaces, that the attacker can render LoRA as an attacker, such as backdoor injection, and widely distribute the adversarial LoRA to the community easily. This can result in detrimental outcomes. Despite the huge potential risks of sharing LoRA modules, this aspect however has not been fully explored. To fill the gap, in this study we thoroughly investigate the attack opportunities enabled in the growing share-and-play scenario. Specifically, we study how to inject backdoor into the LoRA module and dive deeper into LoRA's infection mechanisms. We found that training-free mechanism is possible in LoRA backdoor injection. We also discover the impact of backdoor attacks with the presence of multiple LoRA adaptions concurrently as well as LoRA based backdoor transferability. Our aim is to raise awareness of the potential risks under the emerging share-and-play scenario, so as to proactively prevent potential consequences caused by LoRA-as-an-Attack. Warning: the paper contains potential offensive content generated by models.
- [394] arXiv:2403.00116 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Federated Linear Contextual Bandits with Heterogeneous ClientsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The demand for collaborative and private bandit learning across multiple agents is surging due to the growing quantity of data generated from distributed systems. Federated bandit learning has emerged as a promising framework for private, efficient, and decentralized online learning. However, almost all previous works rely on strong assumptions of client homogeneity, i.e., all participating clients shall share the same bandit model; otherwise, they all would suffer linear regret. This greatly restricts the application of federated bandit learning in practice. In this work, we introduce a new approach for federated bandits for heterogeneous clients, which clusters clients for collaborative bandit learning under the federated learning setting. Our proposed algorithm achieves non-trivial sub-linear regret and communication cost for all clients, subject to the communication protocol under federated learning that at anytime only one model can be shared by the server.
- [395] arXiv:2403.00131 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: UniTS: Building a Unified Time Series ModelShanghua Gao , Teddy Koker , Owen Queen , Thomas Hartvigsen , Theodoros Tsiligkaridis , Marinka ZitnikSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Foundation models, especially LLMs, are profoundly transforming deep learning. Instead of training many task-specific models, we can adapt a single pretrained model to many tasks via fewshot prompting or fine-tuning. However, current foundation models apply to sequence data but not to time series, which present unique challenges due to the inherent diverse and multidomain time series datasets, diverging task specifications across forecasting, classification and other types of tasks, and the apparent need for task-specialized models. We developed UNITS, a unified time series model that supports a universal task specification, accommodating classification, forecasting, imputation, and anomaly detection tasks. This is achieved through a novel unified network backbone, which incorporates sequence and variable attention along with a dynamic linear operator and is trained as a unified model. Across 38 multi-domain datasets, UNITS demonstrates superior performance compared to task-specific models and repurposed natural language-based LLMs. UNITS exhibits remarkable zero-shot, few-shot, and prompt learning capabilities when evaluated on new data domains and tasks. The source code and datasets are available at this https URL .
- [396] arXiv:2403.00141 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EROS: Entity-Driven Controlled Policy Document SummarizationComments: Accepted in LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Privacy policy documents have a crucial role in educating individuals about the collection, usage, and protection of users' personal data by organizations. However, they are notorious for their lengthy, complex, and convoluted language especially involving privacy-related entities. Hence, they pose a significant challenge to users who attempt to comprehend organization's data usage policy. In this paper, we propose to enhance the interpretability and readability of policy documents by using controlled abstractive summarization -- we enforce the generated summaries to include critical privacy-related entities (e.g., data and medium) and organization's rationale (e.g.,target and reason) in collecting those entities. To achieve this, we develop PD-Sum, a policy-document summarization dataset with marked privacy-related entity labels. Our proposed model, EROS, identifies critical entities through a span-based entity extraction model and employs them to control the information content of the summaries using proximal policy optimization (PPO). Comparison shows encouraging improvement over various baselines. Furthermore, we furnish qualitative and human evaluations to establish the efficacy of EROS.
- [397] arXiv:2403.00143 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree AveragingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We address unsupervised discontinuous constituency parsing, where we observe a high variance in the performance of the only previous model. We propose to build an ensemble of different runs of the existing discontinuous parser by averaging the predicted trees, to stabilize and boost performance. To begin with, we provide comprehensive computational complexity analysis (in terms of P and NP-complete) for tree averaging under different setups of binarity and continuity. We then develop an efficient exact algorithm to tackle the task, which runs in a reasonable time for all samples in our experiments. Results on three datasets show our method outperforms all baselines in all metrics; we also provide in-depth analyses of our approach.
- [398] arXiv:2403.00144 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine TranslationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The ability of zero-shot translation emerges when we train a multilingual model with certain translation directions; the model can then directly translate in unseen directions. Alternatively, zero-shot translation can be accomplished by pivoting through a third language (e.g., English). In our work, we observe that both direct and pivot translations are noisy and achieve less satisfactory performance. We propose EBBS, an ensemble method with a novel bi-level beam search algorithm, where each ensemble component explores its own prediction step by step at the lower level but they are synchronized by a "soft voting" mechanism at the upper level. Results on two popular multilingual translation datasets show that EBBS consistently outperforms direct and pivot translations as well as existing ensemble techniques. Further, we can distill the ensemble's knowledge back to the multilingual model to improve inference efficiency; profoundly, our EBBS-based distillation does not sacrifice, or even improves, the translation quality.
- [399] arXiv:2403.00154 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LLMs in Political Science: Heralding a New Era of Visual AnalysisComments: 7 pages, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Interest is increasing among political scientists in leveraging the extensive information available in images. However, the challenge of interpreting these images lies in the need for specialized knowledge in computer vision and access to specialized hardware. As a result, image analysis has been limited to a relatively small group within the political science community. This landscape could potentially change thanks to the rise of large language models (LLMs). This paper aims to raise awareness of the feasibility of using Gemini for image content analysis. A retrospective analysis was conducted on a corpus of 688 images. Content reports were elicited from Gemini for each image and then manually evaluated by the authors. We find that Gemini is highly accurate in performing object detection, which is arguably the most common and fundamental task in image analysis for political scientists. Equally important, we show that it is easy to implement as the entire command consists of a single prompt in natural language; it is fast to run and should meet the time budget of most researchers; and it is free to use and does not require any specialized hardware. In addition, we illustrate how political scientists can leverage Gemini for other image understanding tasks, including face identification, sentiment analysis, and caption generation. Our findings suggest that Gemini and other similar LLMs have the potential to drastically stimulate and accelerate image research in political science and social sciences more broadly.
- [400] arXiv:2403.00172 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Go Beyond Black-box Policies: Rethinking the Design of Learning Agent for Interpretable and Verifiable HVAC ControlComments: Accepted for the 61st Design Automation Conference (DAC)Subjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent research has shown the potential of Model-based Reinforcement Learning (MBRL) to enhance energy efficiency of Heating, Ventilation, and Air Conditioning (HVAC) systems. However, existing methods rely on black-box thermal dynamics models and stochastic optimizers, lacking reliability guarantees and posing risks to occupant health. In this work, we overcome the reliability bottleneck by redesigning HVAC controllers using decision trees extracted from existing thermal dynamics models and historical data. Our decision tree-based policies are deterministic, verifiable, interpretable, and more energy-efficient than current MBRL methods. First, we introduce a novel verification criterion for RL agents in HVAC control based on domain knowledge. Second, we develop a policy extraction procedure that produces a verifiable decision tree policy. We found that the high dimensionality of the thermal dynamics model input hinders the efficiency of policy extraction. To tackle the dimensionality challenge, we leverage importance sampling conditioned on historical data distributions, significantly improving policy extraction efficiency. Lastly, we present an offline verification algorithm that guarantees the reliability of a control policy. Extensive experiments show that our method saves 68.4% more energy and increases human comfort gain by 14.8% compared to the state-of-the-art method, in addition to an 1127x reduction in computation overhead. Our code and data are available at this https URL
- [401] arXiv:2403.00175 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anythingComments: 14 pages, 9 figures, 1 tableJournal-ref: Sensors 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the realm of computer vision, the integration of advanced techniques into the processing of RGB-D camera inputs poses a significant challenge, given the inherent complexities arising from diverse environmental conditions and varying object appearances. Therefore, this paper introduces FusionVision, an exhaustive pipeline adapted for the robust 3D segmentation of objects in RGB-D imagery. Traditional computer vision systems face limitations in simultaneously capturing precise object boundaries and achieving high-precision object detection on depth map as they are mainly proposed for RGB cameras. To address this challenge, FusionVision adopts an integrated approach by merging state-of-the-art object detection techniques, with advanced instance segmentation methods. The integration of these components enables a holistic (unified analysis of information obtained from both color \textit{RGB} and depth \textit{D} channels) interpretation of RGB-D data, facilitating the extraction of comprehensive and accurate object information. The proposed FusionVision pipeline employs YOLO for identifying objects within the RGB image domain. Subsequently, FastSAM, an innovative semantic segmentation model, is applied to delineate object boundaries, yielding refined segmentation masks. The synergy between these components and their integration into 3D scene understanding ensures a cohesive fusion of object detection and segmentation, enhancing overall precision in 3D object segmentation. The code and pre-trained models are publicly available at this https URL .
- [402] arXiv:2403.00176 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: SoD$^2$: Statically Optimizing Dynamic Deep Neural NetworkSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Abstract: Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution, are becoming common. This paper presents SoD$^2$, a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains. Our evaluation results show that SoD$^2$ runs up to $3.9\times$ faster than these systems while saving up to $88\%$ peak memory consumption.
- [403] arXiv:2403.00178 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Causal Graph ODE: Continuous Treatment Effect Modeling in Multi-agent Dynamical SystemsZijie Huang , Jeehyun Hwang , Junkai Zhang , Jinwoo Baik , Weitong Zhang , Dominik Wodarz , Yizhou Sun , Quanquan Gu , Wei WangSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Real-world multi-agent systems are often dynamic and continuous, where the agents co-evolve and undergo changes in their trajectories and interactions over time. For example, the COVID-19 transmission in the U.S. can be viewed as a multi-agent system, where states act as agents and daily population movements between them are interactions. Estimating the counterfactual outcomes in such systems enables accurate future predictions and effective decision-making, such as formulating COVID-19 policies. However, existing methods fail to model the continuous dynamic effects of treatments on the outcome, especially when multiple treatments (e.g., "stay-at-home" and "get-vaccine" policies) are applied simultaneously. To tackle this challenge, we propose Causal Graph Ordinary Differential Equations (CAG-ODE), a novel model that captures the continuous interaction among agents using a Graph Neural Network (GNN) as the ODE function. The key innovation of our model is to learn time-dependent representations of treatments and incorporate them into the ODE function, enabling precise predictions of potential outcomes. To mitigate confounding bias, we further propose two domain adversarial learning-based objectives, which enable our model to learn balanced continuous representations that are not affected by treatments or interference. Experiments on two datasets (i.e., COVID-19 and tumor growth) demonstrate the superior performance of our proposed model.
- [404] arXiv:2403.00190 (cross-list from cs.SI) [ pdf , ps , other ]
-
Title: Identification of important nodes in the information propagation network based on the artificial intelligence methodSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI)
Abstract: This study presents an integrated approach for identifying key nodes in information propagation networks using advanced artificial intelligence methods. We introduce a novel technique that combines the Decision-making Trial and Evaluation Laboratory (DEMATEL) method with the Global Structure Model (GSM), creating a synergistic model that effectively captures both local and global influences within a network. This method is applied across various complex networks, such as social, transportation, and communication systems, utilizing the Global Network Influence Dataset (GNID). Our analysis highlights the structural dynamics and resilience of these networks, revealing insights into node connectivity and community formation. The findings demonstrate the effectiveness of our AI-based approach in offering a comprehensive understanding of network behavior, contributing significantly to strategic network analysis and optimization.
- [405] arXiv:2403.00196 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning to Find Missing Video Frames with Synthetic Data Augmentation: A General Framework and Application in Generating Thermal Images Using RGB CamerasSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Advanced Driver Assistance Systems (ADAS) in intelligent vehicles rely on accurate driver perception within the vehicle cabin, often leveraging a combination of sensing modalities. However, these modalities operate at varying rates, posing challenges for real-time, comprehensive driver state monitoring. This paper addresses the issue of missing data due to sensor frame rate mismatches, introducing a generative model approach to create synthetic yet realistic thermal imagery. We propose using conditional generative adversarial networks (cGANs), specifically comparing the pix2pix and CycleGAN architectures. Experimental results demonstrate that pix2pix outperforms CycleGAN, and utilizing multi-view input styles, especially stacked views, enhances the accuracy of thermal image generation. Moreover, the study evaluates the model's generalizability across different subjects, revealing the importance of individualized training for optimal performance. The findings suggest the potential of generative models in addressing missing frames, advancing driver state monitoring for intelligent vehicles, and underscoring the need for continued research in model generalization and customization.
- [406] arXiv:2403.00198 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: AXOLOTL: Fairness through Assisted Self-Debiasing of Large Language Model OutputsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Pre-trained Large Language Models (LLMs) have significantly advanced natural language processing capabilities but are susceptible to biases present in their training data, leading to unfair outcomes in various applications. While numerous strategies have been proposed to mitigate bias, they often require extensive computational resources and may compromise model performance. In this work, we introduce AXOLOTL, a novel post-processing framework, which operates agnostically across tasks and models, leveraging public APIs to interact with LLMs without direct access to internal parameters. Through a three-step process resembling zero-shot learning, AXOLOTL identifies biases, proposes resolutions, and guides the model to self-debias its outputs. This approach minimizes computational costs and preserves model performance, making AXOLOTL a promising tool for debiasing LLM outputs with broad applicability and ease of use.
- [407] arXiv:2403.00225 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Robust Policy Learning via Offline Skill DiffusionComments: Accepted for AAAI 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Skill-based reinforcement learning (RL) approaches have shown considerable promise, especially in solving long-horizon tasks via hierarchical structures. These skills, learned task-agnostically from offline datasets, can accelerate the policy learning process for new tasks. Yet, the application of these skills in different domains remains restricted due to their inherent dependency on the datasets, which poses a challenge when attempting to learn a skill-based policy via RL for a target domain different from the datasets' domains. In this paper, we present a novel offline skill learning framework DuSkill which employs a guided Diffusion model to generate versatile skills extended from the limited skills in datasets, thereby enhancing the robustness of policy learning for tasks in different domains. Specifically, we devise a guided diffusion-based skill decoder in conjunction with the hierarchical encoding to disentangle the skill embedding space into two distinct representations, one for encapsulating domain-invariant behaviors and the other for delineating the factors that induce domain variations in the behaviors. Our DuSkill framework enhances the diversity of skills learned offline, thus enabling to accelerate the learning procedure of high-level policies for different domains. Through experiments, we show that DuSkill outperforms other skill-based imitation learning and RL algorithms for several long-horizon tasks, demonstrating its benefits in few-shot imitation and online RL.
- [408] arXiv:2403.00236 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performanceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We investigate the performance of LLM-based zero-shot stance detection on tweets. Using FlanT5-XXL, an instruction-tuned open-source LLM, with the SemEval 2016 Tasks 6A, 6B, and P-Stance datasets, we study the performance and its variations under different prompts and decoding strategies, as well as the potential biases of the model. We show that the zero-shot approach can match or outperform state-of-the-art benchmarks, including fine-tuned models. We provide various insights into its performance including the sensitivity to instructions and prompts, the decoding strategies, the perplexity of the prompts, and to negations and oppositions present in prompts. Finally, we ensure that the LLM has not been trained on test datasets, and identify a positivity bias which may partially explain the performance differences across decoding strategie
- [409] arXiv:2403.00250 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the long-tailed recognition field, the Decoupled Training paradigm has demonstrated remarkable capabilities among various methods. This paradigm decouples the training process into separate representation learning and classifier re-training. Previous works have attempted to improve both stages simultaneously, making it difficult to isolate the effect of classifier re-training. Furthermore, recent empirical studies have demonstrated that simple regularization can yield strong feature representations, emphasizing the need to reassess existing classifier re-training methods. In this study, we revisit classifier re-training methods based on a unified feature representation and re-evaluate their performances. We propose a new metric called Logits Magnitude as a superior measure of model performance, replacing the commonly used Weight Norm. However, since it is hard to directly optimize the new metric during training, we introduce a suitable approximate invariant called Regularized Standard Deviation. Based on the two newly proposed metrics, we prove that reducing the absolute value of Logits Magnitude when it is nearly balanced can effectively decrease errors and disturbances during training, leading to better model performance. Motivated by these findings, we develop a simple logits retargeting approach (LORT) without the requirement of prior knowledge of the number of samples per class. LORT divides the original one-hot label into small true label probabilities and large negative label probabilities distributed across each class. Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist2018.
- [410] arXiv:2403.00252 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EUROPA: A Legal Multilingual Keyphrase Generation DatasetOlivier Salaün , Frédéric Piedboeuf , Guillaume Le Berre , David Alfonso Hermelo , Philippe LanglaisComments: 8 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Keyphrase generation has primarily been explored within the context of academic research articles, with a particular focus on scientific domains and the English language. In this work, we present EUROPA, a dataset for multilingual keyphrase generation in the legal domain. It is derived from legal judgments from the Court of Justice of the European Union (EU), and contains instances in all 24 EU official languages. We run multilingual models on our corpus and analyze the results, showing room for improvement on a domain-specific multilingual corpus such as the one we present.
- [411] arXiv:2403.00254 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Cloud-based Federated Learning Framework for MRI SegmentationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In contemporary rural healthcare settings, the principal challenge in diagnosing brain images is the scarcity of available data, given that most of the existing deep learning models demand extensive training data to optimize their performance, necessitating centralized processing methods that potentially compromise data privacy. This paper proposes a novel framework tailored for brain tissue segmentation in rural healthcare facilities. The framework employs a deep reinforcement learning (DRL) environment in tandem with a refinement model (RM) deployed locally at rural healthcare sites. The proposed DRL model has a reduced parameter count and practicality for implementation across distributed rural sites. To uphold data privacy and enhance model generalization without transgressing privacy constraints, we employ federated learning (FL) for cooperative model training. We demonstrate the efficacy of our approach by training the network with a limited data set and observing a substantial performance enhancement, mitigating inaccuracies and irregularities in segmentation across diverse sites. Remarkably, the DRL model attains an accuracy of up to 80%, surpassing the capabilities of conventional convolutional neural networks when confronted with data insufficiency. Incorporating our RM results in an additional accuracy improvement of at least 10%, while FL contributes to a further accuracy enhancement of up to 5%. Collectively, the framework achieves an average 92% accuracy rate within rural healthcare settings characterized by data constraints.
- [412] arXiv:2403.00290 (cross-list from cs.IT) [ pdf , ps , other ]
-
Title: Semantic Text Transmission via Prediction with Small Language Models: Cost-Similarity Trade-offSubjects: Information Theory (cs.IT) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We consider the communication of natural language text from a source to a destination over noiseless and character-erasure channels. We exploit language's inherent correlations and predictability to constrain transmission costs by allowing the destination to predict or complete words with potential dissimilarity with the source text. Concretely, our objective is to obtain achievable $(\bar{c}, \bar{s})$ pairs, where $\bar{c}$ is the average transmission cost at the source and $\bar{s}$ is the average semantic similarity measured via cosine similarity between vector embedding of words at the source and those predicted/completed at the destination. We obtain $(\bar{c}, \bar{s})$ pairs for neural language and first-order Markov chain-based small language models (SLM) for prediction, using both a threshold policy that transmits a word if its cosine similarity with that predicted/completed at the destination is below a threshold, and a periodic policy, which transmits words after a specific interval and predicts/completes the words in between, at the destination. We adopt an SLM for word completion. We demonstrate that, when communication occurs over a noiseless channel, the threshold policy achieves a higher $\bar{s}$ for a given $\bar{c}$ than the periodic policy and that the $\bar{s}$ achieved with the neural SLM is greater than or equal to that of the Markov chain-based algorithm for the same $\bar{c}$. The improved performance comes with a higher complexity in terms of time and computing requirements. However, when communication occurs over a character-erasure channel, all prediction algorithms and scheduling policies perform poorly. Furthermore, if character-level Huffman coding is used, the required $\bar{c}$ to achieve a given $\bar{s}$ is reduced, but the above observations still apply.
- [413] arXiv:2403.00299 (cross-list from cs.IT) [ pdf , ps , html , other ]
-
Title: Universal Auto-encoder Framework for MIMO CSI FeedbackComments: 7 pages, 11 figuresSubjects: Information Theory (cs.IT) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract: Existing auto-encoder (AE)-based channel state information (CSI) frameworks have focused on a specific configuration of user equipment (UE) and base station (BS), and thus the input and output sizes of the AE are fixed. However, in the real-world scenario, the input and output sizes may vary depending on the number of antennas of the BS and UE and the allocated resource block in the frequency dimension. A naive approach to support the different input and output sizes is to use multiple AE models, which is impractical for the UE due to the limited HW resources. In this paper, we propose a universal AE framework that can support different input sizes and multiple compression ratios. The proposed AE framework significantly reduces the HW complexity while providing comparable performance in terms of compression ratio-distortion trade-off compared to the naive and state-of-the-art approaches.
- [414] arXiv:2403.00307 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Embedded Multi-label Feature Selection via Orthogonal RegressionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the last decade, embedded multi-label feature selection methods, incorporating the search for feature subsets into model optimization, have attracted considerable attention in accurately evaluating the importance of features in multi-label classification tasks. Nevertheless, the state-of-the-art embedded multi-label feature selection algorithms based on least square regression usually cannot preserve sufficient discriminative information in multi-label data. To tackle the aforementioned challenge, a novel embedded multi-label feature selection method, termed global redundancy and relevance optimization in orthogonal regression (GRROOR), is proposed to facilitate the multi-label feature selection. The method employs orthogonal regression with feature weighting to retain sufficient statistical and structural information related to local label correlations of the multi-label data in the feature learning process. Additionally, both global feature redundancy and global label relevancy information have been considered in the orthogonal regression model, which could contribute to the search for discriminative and non-redundant feature subsets in the multi-label data. The cost function of GRROOR is an unbalanced orthogonal Procrustes problem on the Stiefel manifold. A simple yet effective scheme is utilized to obtain an optimal solution. Extensive experimental results on ten multi-label data sets demonstrate the effectiveness of GRROOR.
- [415] arXiv:2403.00336 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Never-Ending Embodied Robot LearningComments: 14 pages, 5 figures, 8 tablesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Relying on large language models (LLMs), embodied robots could perform complex multimodal robot manipulation tasks from visual observations with powerful generalization ability. However, most visual behavior-cloning agents suffer from manipulation performance degradation and skill knowledge forgetting when adapting into a series of challenging unseen tasks. We here investigate the above challenge with NBCagent in embodied robots, a pioneering language-conditioned Never-ending Behavior-Cloning agent, which can continually learn observation knowledge of novel robot manipulation skills from skill-specific and skill-shared attributes. Specifically, we establish a skill-specific evolving planner to perform knowledge decoupling, which can continually embed novel skill-specific knowledge in our NBCagent agent from latent and low-rank space. Meanwhile, we propose a skill-shared semantics rendering module and a skill-shared representation distillation module to effectively transfer anti-forgetting skill-shared knowledge, further tackling catastrophic forgetting on old skills from semantics and representation aspects. Finally, we design a continual embodied robot manipulation benchmark, and several expensive experiments demonstrate the significant performance of our method. Visual results, code, and dataset are provided at: this https URL .
- [416] arXiv:2403.00353 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MS-Net: A Multi-Path Sparse Model for Motion Prediction in Multi-ScenesComments: Accepted by IEEE Robotics and Automation Letters (RAL)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: The multi-modality and stochastic characteristics of human behavior make motion prediction a highly challenging task, which is critical for autonomous driving. While deep learning approaches have demonstrated their great potential in this area, it still remains unsolved to establish a connection between multiple driving scenes (e.g., merging, roundabout, intersection) and the design of deep learning models. Current learning-based methods typically use one unified model to predict trajectories in different scenarios, which may result in sub-optimal results for one individual scene. To address this issue, we propose Multi-Scenes Network (aka. MS-Net), which is a multi-path sparse model trained by an evolutionary process. MS-Net selectively activates a subset of its parameters during the inference stage to produce prediction results for each scene. In the training stage, the motion prediction task under differentiated scenes is abstracted as a multi-task learning problem, an evolutionary algorithm is designed to encourage the network search of the optimal parameters for each scene while sharing common knowledge between different scenes. Our experiment results show that with substantially reduced parameters, MS-Net outperforms existing state-of-the-art methods on well-established pedestrian motion prediction datasets, e.g., ETH and UCY, and ranks the 2nd place on the INTERACTION challenge.
- [417] arXiv:2403.00376 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Invariant Test-Time Adaptation for Vision-Language Model GeneralizationHuan Ma , Yan Zhu , Changqing Zhang , Peilin Zhao , Baoyuan Wu , Long-Kai Huang , Qinghua Hu , Bingzhe WuSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired datasets. However, these models display significant limitations when applied to long-tail tasks, such as fine-grained image classification, as a result of "decision shortcuts" that hinders their generalization capabilities. In this work, we find that the CLIP model possesses a rich set of features, encompassing both \textit{desired invariant causal features} and \textit{undesired decision shortcuts}. Moreover, the underperformance of CLIP on downstream tasks originates from its inability to effectively utilize pre-trained features in accordance with specific task requirements. To address this challenge, this paper introduces a test-time prompt tuning paradigm that optimizes a learnable prompt, thereby compelling the model to exploit genuine causal invariant features while disregarding decision shortcuts during the inference phase. The proposed method effectively alleviates excessive dependence on potentially misleading, task-irrelevant contextual information, while concurrently emphasizing critical, task-related visual cues. We conduct comparative analysis of the proposed method against various approaches which validates its effectiveness.
- [418] arXiv:2403.00396 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GLFNET: Global-Local (frequency) Filter Networks for efficient medical image segmentationAthanasios Tragakis , Qianying Liu , Chaitanya Kaul , Swalpa Kumar Roy , Hang Dai , Fani Deligianni , Roderick Murray-Smith , Daniele FaccioSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We propose a novel transformer-style architecture called Global-Local Filter Network (GLFNet) for medical image segmentation and demonstrate its state-of-the-art performance. We replace the self-attention mechanism with a combination of global-local filter blocks to optimize model efficiency. The global filters extract features from the whole feature map whereas the local filters are being adaptively created as 4x4 patches of the same feature map and add restricted scale information. In particular, the feature extraction takes place in the frequency domain rather than the commonly used spatial (image) domain to facilitate faster computations. The fusion of information from both spatial and frequency spaces creates an efficient model with regards to complexity, required data and performance. We test GLFNet on three benchmark datasets achieving state-of-the-art performance on all of them while being almost twice as efficient in terms of GFLOP operations.
- [419] arXiv:2403.00420 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A SurveyLucas Schott , Josephine Delas , Hatem Hajri , Elies Gherbi , Reda Yaich , Nora Boulahia-Cuppens , Frederic Cuppens , Sylvain LamprierComments: 57 pages, 16 figues, 2 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep Reinforcement Learning (DRL) is an approach for training autonomous agents across various complex environments. Despite its significant performance in well known environments, it remains susceptible to minor conditions variations, raising concerns about its reliability in real-world applications. To improve usability, DRL must demonstrate trustworthiness and robustness. A way to improve robustness of DRL to unknown changes in the conditions is through Adversarial Training, by training the agent against well suited adversarial attacks on the dynamics of the environment. Addressing this critical issue, our work presents an in-depth analysis of contemporary adversarial attack methodologies, systematically categorizing them and comparing their objectives and operational mechanisms. This classification offers a detailed insight into how adversarial attacks effectively act for evaluating the resilience of DRL agents, thereby paving the way for enhancing their robustness.
- [420] arXiv:2403.00425 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HALC: Object Hallucination Reduction via Adaptive Focal-Contrast DecodingComments: Code is released at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate the effectiveness of HALC in reducing OH, outperforming state-of-the-arts across four benchmarks.
- [421] arXiv:2403.00436 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Abductive Ego-View Accident Video Understanding for Safe Driving PerceptionJianwu Fang , Lei-lei Li , Junfei Zhou , Junbin Xiao , Hongkai Yu , Chen Lv , Jianru Xue , Tat-Seng ChuaComments: Accepted by CVPR2024. This is not the camera-ready version. The Project page: this http URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each with temporally aligned text descriptions. We annotate over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. MM-AU supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. With MM-AU, we present an Abductive accident Video understanding framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, accident frames with the corresponding text descriptions, such as accident reasons, prevention advice, and accident categories. OAVD enforces the causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. Additionally, we provide careful benchmark evaluations for object detection and accident reason answering since AdVersa-SD relies on precise object and accident reason information.
- [422] arXiv:2403.00437 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LoMOE: Localized Multi-Object Editing via Multi-DiffusionComments: 18 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Abstract: Recent developments in the field of diffusion models have demonstrated an exceptional capacity to generate high-quality prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing $\textbf{many}$ objects in a complex scene $\textbf{in one pass}$. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the current methods. We also curate and release a dataset dedicated to multi-object editing, named $\texttt{LoMOE}$-Bench. Our experiments against existing state-of-the-art methods demonstrate the improved effectiveness of our approach in terms of both image editing quality and inference speed.
- [423] arXiv:2403.00439 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Authors' Values and Attitudes Towards AI-bridged Scalable Personalization of Creative Language ArtsComments: 16 pages, 6 figures, 2 tables. Accepted to ACM CHI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Generative AI has the potential to create a new form of interactive media: AI-bridged creative language arts (CLA), which bridge the author and audience by personalizing the author's vision to the audience's context and taste at scale. However, it is unclear what the authors' values and attitudes would be regarding AI-bridged CLA. To identify these values and attitudes, we conducted an interview study with 18 authors across eight genres (e.g., poetry, comics) by presenting speculative but realistic AI-bridged CLA scenarios. We identified three benefits derived from the dynamics between author, artifact, and audience: those that 1) authors get from the process, 2) audiences get from the artifact, and 3) authors get from the audience. We found how AI-bridged CLA would either promote or reduce these benefits, along with authors' concerns. We hope our investigation hints at how AI can provide intriguing experiences to CLA audiences while promoting authors' values.
- [424] arXiv:2403.00450 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Parallel Hyperparameter Optimization Of Spiking Neural NetworkSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Spiking Neural Networks (SNN). SNNs are based on a more biologically inspired approach than usual artificial neural networks. Such models are characterized by complex dynamics between neurons and spikes. These are very sensitive to the hyperparameters, making their optimization challenging. To tackle hyperparameter optimization of SNNs, we initially extended the signal loss issue of SNNs to what we call silent networks. These networks fail to emit enough spikes at their outputs due to mistuned hyperparameters or architecture. Generally, search spaces are heavily restrained, sometimes even discretized, to prevent the sampling of such networks. By defining an early stopping criterion detecting silent networks and by designing specific constraints, we were able to instantiate larger and more flexible search spaces. We applied a constrained Bayesian optimization technique, which was asynchronously parallelized, as the evaluation time of a SNN is highly stochastic. Large-scale experiments were carried-out on a multi-GPU Petascale architecture. By leveraging silent networks, results show an acceleration of the search, while maintaining good performances of both the optimization algorithm and the best solution obtained. We were able to apply our methodology to two popular training algorithms, known as spike timing dependent plasticity and surrogate gradient. Early detection allowed us to prevent worthless and costly computation, directing the search toward promising hyperparameter combinations. Our methodology could be applied to multi-objective problems, where the spiking activity is often minimized to reduce the energy consumption. In this scenario, it becomes essential to find the delicate frontier between low-spiking and silent networks. Finally, our approach may have implications for neural architecture search, particularly in defining suitable spiking architectures.
- [425] arXiv:2403.00504 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning and Leveraging World Models in Visual Representation LearningComments: 23 pages, 16 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model. While previously limited to predicting missing parts of an input, we explore how to generalize the JEPA prediction task to a broader set of corruptions. We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space. We study the recipe of learning performant IWMs and show that it relies on three key aspects: conditioning, prediction difficulty, and capacity. Additionally, we show that the predictive world model learned by IWM can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM world model matches or surpasses the performance of previous self-supervised methods. Finally, we show that learning with an IWM allows one to control the abstraction level of the learned representations, learning invariant representations such as contrastive methods, or equivariant representations such as masked image modelling.
- [426] arXiv:2403.00509 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Surveying the Dead Minds: Historical-Psychological Text Analysis with Contextualized Construct Representation (CCR) for Classical ChineseSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: In this work, we develop a pipeline for historical-psychological text analysis in classical Chinese. Humans have produced texts in various languages for thousands of years; however, most of the computational literature is focused on contemporary languages and corpora. The emerging field of historical psychology relies on computational techniques to extract aspects of psychology from historical corpora using new methods developed in natural language processing (NLP). The present pipeline, called Contextualized Construct Representations (CCR), combines expert knowledge in psychometrics (i.e., psychological surveys) with text representations generated via transformer-based language models to measure psychological constructs such as traditionalism, norm strength, and collectivism in classical Chinese corpora. Considering the scarcity of available data, we propose an indirect supervised contrastive learning approach and build the first Chinese historical psychology corpus (C-HI-PSY) to fine-tune pre-trained models. We evaluate the pipeline to demonstrate its superior performance compared with other approaches. The CCR method outperforms word-embedding-based approaches across all of our tasks and exceeds prompting with GPT-4 in most tasks. Finally, we benchmark the pipeline against objective, external data to further verify its validity.
- [427] arXiv:2403.00510 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ROME: Memorization Insights from Text, Probability and Hidden State in Large Language ModelsComments: Submitted to ACL, 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Probing the memorization of large language models holds significant importance. Previous works have established metrics for quantifying memorization, explored various influencing factors, such as data duplication, model size, and prompt length, and evaluated memorization by comparing model outputs with training corpora. However, the training corpora are of enormous scale and its pre-processing is time-consuming. To explore memorization without accessing training data, we propose a novel approach, named ROME, wherein memorization is explored by comparing disparities across memorized and non-memorized. Specifically, models firstly categorize the selected samples into memorized and non-memorized groups, and then comparing the demonstrations in the two groups from the insights of text, probability, and hidden state. Experimental findings show the disparities in factors including word length, part-of-speech, word frequency, mean and variance, just to name a few.
- [428] arXiv:2403.00550 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and BenchmarkingComments: his paper has been accepted in the demonstration track for the 23rd International Conference on Autonomous Agents and Multi-Agent SystemsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Imitation learning field requires expert data to train agents in a task. Most often, this learning approach suffers from the absence of available data, which results in techniques being tested on its dataset. Creating datasets is a cumbersome process requiring researchers to train expert agents from scratch, record their interactions and test each benchmark method with newly created data. Moreover, creating new datasets for each new technique results in a lack of consistency in the evaluation process since each dataset can drastically vary in state and action distribution. In response, this work aims to address these issues by creating Imitation Learning Datasets, a toolkit that allows for: (i) curated expert policies with multithreaded support for faster dataset creation; (ii) readily available datasets and techniques with precise measurements; and (iii) sharing implementations of common imitation learning techniques. Demonstration link: this https URL
- [429] arXiv:2403.00561 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Multi-Task Learning Using Uncertainty to Weigh Losses for Heterogeneous Face Attribute EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Face images contain a wide variety of attribute information. In this paper, we propose a generalized framework for joint estimation of ordinal and nominal attributes based on information sharing. We tackle the correlation problem between heterogeneous attributes using hard parameter sharing of shallow features, and trade-off multiple loss functions by considering homoskedastic uncertainty for each attribute estimation task. This leads to optimal estimation of multiple attributes of the face and reduces the training cost of multitask learning. Experimental results on benchmarks with multiple face attributes show that the proposed approach has superior performance compared to state of the art. Finally, we discuss the bias issues arising from the proposed approach in face attribute estimation and validate its feasibility on edge systems.
- [430] arXiv:2403.00564 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: EfficientZero V2: Mastering Discrete and Continuous Control with Limited DataComments: 21 pages,10 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Sample efficiency remains a crucial challenge in applying Reinforcement Learning (RL) to real-world tasks. While recent algorithms have made significant strides in improving sample efficiency, none have achieved consistently superior performance across diverse domains. In this paper, we introduce EfficientZero V2, a general framework designed for sample-efficient RL algorithms. We have expanded the performance of EfficientZero to multiple domains, encompassing both continuous and discrete actions, as well as visual and low-dimensional inputs. With a series of improvements we propose, EfficientZero V2 outperforms the current state-of-the-art (SOTA) by a significant margin in diverse tasks under the limited data setting. EfficientZero V2 exhibits a notable advancement over the prevailing general algorithm, DreamerV3, achieving superior outcomes in 50 of 66 evaluated tasks across diverse benchmarks, such as Atari 100k, Proprio Control, and Vision Control.
- [431] arXiv:2403.00565 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Predicting UAV Type: An Exploration of Sampling and Data Augmentation for Time Series ClassificationComments: 12 pages, 3 figures, 4 tables, submitted to IEEE Transactions on CyberneticsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Unmanned aerial vehicles are becoming common and have many productive uses. However, their increased prevalence raises safety concerns -- how can we protect restricted airspace? Knowing the type of unmanned aerial vehicle can go a long way in determining any potential risks it carries. For instance, fixed-wing craft can carry more weight over longer distances, thus potentially posing a more significant threat. This paper presents a machine learning model for classifying unmanned aerial vehicles as quadrotor, hexarotor, or fixed-wing. Our approach effectively applies a Long-Short Term Memory (LSTM) neural network for the purpose of time series classification. We performed experiments to test the effects of changing the timestamp sampling method and addressing the imbalance in the class distribution. Through these experiments, we identified the top-performing sampling and class imbalance fixing methods. Averaging the macro f-scores across 10 folds of data, we found that the majority quadrotor class was predicted well (98.16%), and, despite an extreme class imbalance, the model could also predicted a majority of fixed-wing flights correctly (73.15%). Hexarotor instances were often misclassified as quadrotors due to the similarity of multirotors in general (42.15%). However, results remained relatively stable across certain methods, which prompted us to analyze and report on their tradeoffs. The supplemental material for this paper, including the code and data for running all the experiments and generating the results tables, is available at this https URL .
- [432] arXiv:2403.00567 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Cross-domain few-shot learning (CDFSL) aims to acquire knowledge from limited training data in the target domain by leveraging prior knowledge transferred from source domains with abundant training samples. CDFSL faces challenges in transferring knowledge across dissimilar domains and fine-tuning models with limited training data. To address these challenges, we initially extend the analysis of loss landscapes from the parameter space to the representation space, which allows us to simultaneously interpret the transferring and fine-tuning difficulties of CDFSL models. We observe that sharp minima in the loss landscapes of the representation space result in representations that are hard to transfer and fine-tune. Moreover, existing flatness-based methods have limited generalization ability due to their short-range flatness. To enhance the transferability and facilitate fine-tuning, we introduce a simple yet effective approach to achieve long-range flattening of the minima in the loss landscape. This approach considers representations that are differently normalized as minima in the loss landscape and flattens the high-loss region in the middle by randomly sampling interpolated representations. We implement this method as a new normalization layer that replaces the original one in both CNNs and ViTs. This layer is simple and lightweight, introducing only a minimal number of additional parameters. Experimental results on 8 datasets demonstrate that our approach outperforms state-of-the-art methods in terms of average accuracy. Moreover, our method achieves performance improvements of up to 9\% compared to the current best approaches on individual datasets. Our code will be released.
- [433] arXiv:2403.00570 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Rethinking cluster-conditioned diffusion modelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We present a comprehensive experimental study on image-level conditioning for diffusion models using cluster assignments. We elucidate how individual components regarding image clustering impact image synthesis across three datasets. By combining recent advancements from image clustering and diffusion models, we show that, given the optimal cluster granularity with respect to image synthesis (visual groups), cluster-conditioning can achieve state-of-the-art FID (i.e. 1.67, 2.17 on CIFAR10 and CIFAR100 respectively), while attaining a strong training sample efficiency. Finally, we propose a novel method to derive an upper cluster bound that reduces the search space of the visual groups using solely feature-based clustering. Unlike existing approaches, we find no significant connection between clustering and cluster-conditional image generation. The code and cluster assignments will be released.
- [434] arXiv:2403.00587 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived DatasetComments: 12 pages and 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Existing work has observed that current text-to-image systems do not accurately reflect explicit spatial relations between objects such as 'left of' or 'below'. We hypothesize that this is because explicit spatial relations rarely appear in the image captions used to train these models. We propose an automatic method that, given existing images, generates synthetic captions that contain 14 explicit spatial relations. We introduce the Spatial Relation for Generation (SR4G) dataset, which contains 9.9 millions image-caption pairs for training, and more than 60 thousand captions for evaluation. In order to test generalization we also provide an 'unseen' split, where the set of objects in the train and test captions are disjoint. SR4G is the first dataset that can be used to spatially fine-tune text-to-image systems. We show that fine-tuning two different Stable Diffusion models (denoted as SD$_{SR4G}$) yields up to 9 points improvements in the VISOR metric. The improvement holds in the 'unseen' split, showing that SD$_{SR4G}$ is able to generalize to unseen objects. SD$_{SR4G}$ improves the state-of-the-art with fewer parameters, and avoids complex architectures. Our analysis shows that improvement is consistent for all relations. The dataset and the code will be publicly available.
- [435] arXiv:2403.00632 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Metamorpheus: Interactive, Affective, and Creative Dream Narration Through Metaphorical Visual StorytellingComments: Accepted by CHI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: Human emotions are essentially molded by lived experiences, from which we construct personalised meaning. The engagement in such meaning-making process has been practiced as an intervention in various psychotherapies to promote wellness. Nevertheless, to support recollecting and recounting lived experiences in everyday life remains under explored in HCI. It also remains unknown how technologies such as generative AI models can facilitate the meaning making process, and ultimately support affective mindfulness. In this paper we present Metamorpheus, an affective interface that engages users in a creative visual storytelling of emotional experiences during dreams. Metamorpheus arranges the storyline based on a dream's emotional arc, and provokes self-reflection through the creation of metaphorical images and text depictions. The system provides metaphor suggestions, and generates visual metaphors and text depictions using generative AI models, while users can apply generations to recolour and re-arrange the interface to be visually affective. Our experience-centred evaluation manifests that, by interacting with Metamorpheus, users can recall their dreams in vivid detail, through which they relive and reflect upon their experiences in a meaningful way.
- [436] arXiv:2403.00642 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Rethinking The Uniformity Metric in Self-Supervised LearningJournal-ref: ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Uniformity plays an important role in evaluating learned representations, providing insights into self-supervised learning. In our quest for effective uniformity metrics, we pinpoint four principled properties that such metrics should possess. Namely, an effective uniformity metric should remain invariant to instance permutations and sample replications while accurately capturing feature redundancy and dimensional collapse. Surprisingly, we find that the uniformity metric proposed by \citet{Wang2020UnderstandingCR} fails to satisfy the majority of these properties. Specifically, their metric is sensitive to sample replications, and can not account for feature redundancy and dimensional collapse correctly. To overcome these limitations, we introduce a new uniformity metric based on the Wasserstein distance, which satisfies all the aforementioned properties. Integrating this new metric in existing self-supervised learning methods effectively mitigates dimensional collapse and consistently improves their performance on downstream tasks involving CIFAR-10 and CIFAR-100 datasets. Code is available at \url{ this https URL }.
- [437] arXiv:2403.00691 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Tri-Modal Motion Retrieval by Learning a Joint Embedding SpaceSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motion-to-video.
- [438] arXiv:2403.00692 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Toward Autonomous Cooperation in Heterogeneous Nanosatellite Constellations Using Dynamic Graph Neural NetworksComments: 8 pages, 5 figures, conferenceSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Abstract: The upcoming landscape of Earth Observation missions will defined by networked heterogeneous nanosatellite constellations required to meet strict mission requirements, such as revisit times and spatial resolution. However, scheduling satellite communications in these satellite networks through efficiently creating a global satellite Contact Plan (CP) is a complex task, with current solutions requiring ground-based coordination or being limited by onboard computational resources. The paper proposes a novel approach to overcome these challenges by modeling the constellations and CP as dynamic networks and employing graph-based techniques. The proposed method utilizes a state-of-the-art dynamic graph neural network to evaluate the performance of a given CP and update it using a heuristic algorithm based on simulated annealing. The trained neural network can predict the network delay with a mean absolute error of 3.6 minutes. Simulation results show that the proposed method can successfully design a contact plan for large satellite networks, improving the delay by 29.1%, similar to a traditional approach, while performing the objective evaluations 20x faster.
- [439] arXiv:2403.00694 (cross-list from stat.ML) [ pdf , ps , other ]
-
Title: Defining Expertise: Applications to Treatment Effect EstimationComments: The 12th International Conference on Learning Representations (ICLR 2024)Subjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
Abstract: Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients, we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and "expertise" is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise - particularly the type of expertise the decision-makers of a domain are likely to have - can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection.
- [440] arXiv:2403.00742 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Dialect prejudice predicts AI decisions about people's character, employability, and criminalitySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Hundreds of millions of people now interact with language models, with uses ranging from serving as a writing aid to informing hiring decisions. Yet these language models are known to perpetuate systematic racial prejudices, making their judgments biased in problematic ways about groups like African Americans. While prior research has focused on overt racism in language models, social scientists have argued that racism with a more subtle character has developed over time. It is unknown whether this covert racism manifests in language models. Here, we demonstrate that language models embody covert racism in the form of dialect prejudice: we extend research showing that Americans hold raciolinguistic stereotypes about speakers of African American English and find that language models have the same prejudice, exhibiting covert stereotypes that are more negative than any human stereotypes about African Americans ever experimentally recorded, although closest to the ones from before the civil rights movement. By contrast, the language models' overt stereotypes about African Americans are much more positive. We demonstrate that dialect prejudice has the potential for harmful consequences by asking language models to make hypothetical decisions about people, based only on how they speak. Language models are more likely to suggest that speakers of African American English be assigned less prestigious jobs, be convicted of crimes, and be sentenced to death. Finally, we show that existing methods for alleviating racial bias in language models such as human feedback training do not mitigate the dialect prejudice, but can exacerbate the discrepancy between covert and overt stereotypes, by teaching language models to superficially conceal the racism that they maintain on a deeper level. Our findings have far-reaching implications for the fair and safe employment of language technology.
- [441] arXiv:2403.00758 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation TrainingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: While large language models (LLMs) have achieved impressive performance across diverse tasks, recent studies showcase that causal LLMs suffer from the "reversal curse". It is a typical example that the model knows "A's father is B", but is unable to reason "B's child is A". This limitation poses a challenge to the advancement of artificial general intelligence (AGI), as it suggests a gap in the models' ability to comprehend and apply bidirectional reasoning. In this paper, we first conduct substantial evaluation and identify that the root cause of the reversal curse lies in the different word order between the training and inference stage, namely, the poor ability of causal language models to predict antecedent words within the training data. Accordingly, permutation on the training data is considered as a potential solution, since this can make the model predict antecedent words or tokens. However, previous permutation methods may disrupt complete phrases or entities, thereby posing challenges for the model to comprehend and learn from training data. To address this issue, we propose Semantic-aware Permutation Training (SPT), which addresses this issue by segmenting the training sentences into semantic units (i.e., entities or phrases) with an assistant language model and permuting these units before feeding into the model. Extensive experiments demonstrate that SPT effectively mitigates the reversal curse since the performance on reversed questions approximates that on the forward ones, and significantly advances the performance of existing works.
- [442] arXiv:2403.00765 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: An Architecture for Unattended Containerized (Deep) Reinforcement Learning with WebotsComments: Latex with llncs.cls, 17 pages, 5 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: As data science applications gain adoption across industries, the tooling landscape matures to facilitate the life cycle of such applications and provide solutions to the challenges involved to boost the productivity of the people involved. Reinforcement learning with agents in a 3D world could still face challenges: the knowledge required to use a simulation software as well as the utilization of a standalone simulation software in unattended training pipelines.
In this paper we review tools and approaches to train reinforcement learning agents for robots in 3D worlds with respect to the robot Robotino and argue that the separation of the simulation environment for creators of virtual worlds and the model development environment for data scientists is not a well covered topic. Often both are the same and data scientists require knowledge of the simulation software to work directly with their APIs. Moreover, sometimes creators of virtual worlds and data scientists even work on the same files. We want to contribute to that topic by describing an approach where data scientists don't require knowledge about the simulation software. Our approach uses the standalone simulation software Webots, the Robot Operating System to communicate with simulated robots as well as the simulation software itself and container technology to separate the simulation from the model development environment. We put emphasize on the APIs the data scientists work with and the use of a standalone simulation software in unattended training pipelines. We show the parts that are specific to the Robotino and the robot task to learn. - [443] arXiv:2403.00772 (cross-list from q-fin.ST) [ pdf , ps , html , other ]
-
Title: Do Weibo platform experts perform better at predicting stock market?Journal-ref: 2021, 22nd Engineering Applications of Neural Networks Conference (EANN 2021)Subjects: Statistical Finance (q-fin.ST) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: Sentiment analysis can be used for stock market prediction. However, existing research has not studied the impact of a user's financial background on sentiment-based forecasting of the stock market using artificial neural networks. In this work, a novel combination of neural networks is used for the assessment of sentiment-based stock market prediction, based on the financial background of the population that generated the sentiment. The state-of-the-art language processing model Bidirectional Encoder Representations from Transformers (BERT) is used to classify the sentiment and a Long-Short Term Memory (LSTM) model is used for time-series based stock market prediction. For evaluation, the Weibo social networking platform is used as a sentiment data collection source. Weibo users (and their comments respectively) are divided into Authorized Financial Advisor (AFA) and Unauthorized Financial Advisor (UFA) groups according to their background information, as collected by Weibo. The Hong Kong Hang Seng index is used to extract historical stock market change data. The results indicate that stock market prediction learned from the AFA group users is 39.67% more precise than that learned from the UFA group users and shows the highest accuracy (87%) when compared to existing approaches.
- [444] arXiv:2403.00780 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Empirical and Experimental Insights into Data Mining Techniques for Crime Prediction: A Comprehensive SurveySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This survey paper presents a comprehensive analysis of crime prediction methodologies, exploring the various techniques and technologies utilized in this area. The paper covers the statistical methods, machine learning algorithms, and deep learning techniques employed to analyze crime data, while also examining their effectiveness and limitations. We propose a methodological taxonomy that classifies crime prediction algorithms into specific techniques. This taxonomy is structured into four tiers, including methodology category, methodology sub-category, methodology techniques, and methodology sub-techniques. Empirical and experimental evaluations are provided to rank the different techniques. The empirical evaluation assesses the crime prediction techniques based on four criteria, while the experimental evaluation ranks the algorithms that employ the same sub-technique, the different sub-techniques that employ the same technique, the different techniques that employ the same methodology sub-category, the different methodology sub-categories within the same category, and the different methodology categories. The combination of methodological taxonomy, empirical evaluations, and experimental comparisons allows for a nuanced and comprehensive understanding of crime prediction algorithms, aiding researchers in making informed decisions. Finally, the paper provides a glimpse into the future of crime prediction techniques, highlighting potential advancements and opportunities for further research in this field
- [445] arXiv:2403.00781 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: ChatDiet: Empowering Personalized Nutrition-Oriented Food Recommender Chatbots through an LLM-Augmented FrameworkZhongqi Yang , Elahe Khatibi , Nitish Nagesh , Mahyar Abbasian , Iman Azimi , Ramesh Jain , Amir M. RahmaniComments: Accepted by The IEEE/ACM international conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE) 2024Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract: The profound impact of food on health necessitates advanced nutrition-oriented food recommendation services. Conventional methods often lack the crucial elements of personalization, explainability, and interactivity. While Large Language Models (LLMs) bring interpretability and explainability, their standalone use falls short of achieving true personalization. In this paper, we introduce ChatDiet, a novel LLM-powered framework designed specifically for personalized nutrition-oriented food recommendation chatbots. ChatDiet integrates personal and population models, complemented by an orchestrator, to seamlessly retrieve and process pertinent information. The personal model leverages causal discovery and inference techniques to assess personalized nutritional effects for a specific user, whereas the population model provides generalized information on food nutritional content. The orchestrator retrieves, synergizes and delivers the output of both models to the LLM, providing tailored food recommendations designed to support targeted health outcomes. The result is a dynamic delivery of personalized and explainable food recommendations, tailored to individual user preferences. Our evaluation of ChatDiet includes a compelling case study, where we establish a causal personal model to estimate individual nutrition effects. Our assessments, including a food recommendation test showcasing a 92\% effectiveness rate, coupled with illustrative dialogue examples, underscore ChatDiet's strengths in explainability, personalization, and interactivity.
- [446] arXiv:2403.00782 (cross-list from q-fin.ST) [ pdf , ps , html , other ]
-
Title: Ploutos: Towards interpretable stock movement prediction with financial large language modelComments: 8 pages, 4 figuresSubjects: Statistical Finance (q-fin.ST) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent advancements in large language models (LLMs) have opened new pathways for many domains. However, the full potential of LLMs in financial investments remains largely untapped. There are two main challenges for typical deep learning-based methods for quantitative finance. First, they struggle to fuse textual and numerical information flexibly for stock movement prediction. Second, traditional methods lack clarity and interpretability, which impedes their application in scenarios where the justification for predictions is essential. To solve the above challenges, we propose Ploutos, a novel financial LLM framework that consists of PloutosGen and PloutosGPT. The PloutosGen contains multiple primary experts that can analyze different modal data, such as text and numbers, and provide quantitative strategies from different perspectives. Then PloutosGPT combines their insights and predictions and generates interpretable rationales. To generate accurate and faithful rationales, the training strategy of PloutosGPT leverage rearview-mirror prompting mechanism to guide GPT-4 to generate rationales, and a dynamic token weighting mechanism to finetune LLM by increasing key tokens weight. Extensive experiments show our framework outperforms the state-of-the-art methods on both prediction accuracy and interpretability.
- [447] arXiv:2403.00784 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and ChallengesJiajia Wang , Jimmy X. Huang , Xinhui Tu , Junmei Wang , Angela J. Huang , Md Tahmid Rahman Laskar , Amran BhuiyanSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent years have witnessed a substantial increase in the use of deep learning to solve various natural language processing (NLP) problems. Early deep learning models were constrained by their sequential or unidirectional nature, such that they struggled to capture the contextual relationships across text inputs. The introduction of bidirectional encoder representations from transformers (BERT) leads to a robust encoder for the transformer model that can understand the broader context and deliver state-of-the-art performance across various NLP tasks. This has inspired researchers and practitioners to apply BERT to practical problems, such as information retrieval (IR). A survey that focuses on a comprehensive analysis of prevalent approaches that apply pretrained transformer encoders like BERT to IR can thus be useful for academia and the industry. In light of this, we revisit a variety of BERT-based methods in this survey, cover a wide range of techniques of IR, and group them into six high-level categories: (i) handling long documents, (ii) integrating semantic information, (iii) balancing effectiveness and efficiency, (iv) predicting the weights of terms, (v) query expansion, and (vi) document expansion. We also provide links to resources, including datasets and toolkits, for BERT-based IR systems. A key highlight of our survey is the comparison between BERT's encoder-based models and the latest generative Large Language Models (LLMs), such as ChatGPT, which rely on decoders. Despite the popularity of LLMs, we find that for specific tasks, finely tuned BERT encoders still outperform, and at a lower deployment cost. Finally, we summarize the comprehensive outcomes of the survey and suggest directions for future research in the area.
- [448] arXiv:2403.00788 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: PRECISE Framework: GPT-based Text For Improved Readability, Reliability, and Understandability of Radiology Reports For Patient-Centered CareSatvik Tripathi , Liam Mutter , Meghana Muppuri , Suhani Dheer , Emiliano Garza-Frias , Komal Awan , Aakash Jha , Michael Dezube , Azadeh Tabari , Christopher P. Bridge , Dania DayeSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: This study introduces and evaluates the PRECISE framework, utilizing OpenAI's GPT-4 to enhance patient engagement by providing clearer and more accessible chest X-ray reports at a sixth-grade reading level. The framework was tested on 500 reports, demonstrating significant improvements in readability, reliability, and understandability. Statistical analyses confirmed the effectiveness of the PRECISE approach, highlighting its potential to foster patient-centric care delivery in healthcare decision-making.
- [449] arXiv:2403.00790 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Structuring Concept Space with the Musical Circle of Fifths by Utilizing Music Grammar Based ActivationsComments: 3 pagesSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: In this paper, we explore the intriguing similarities between the structure of a discrete neural network, such as a spiking network, and the composition of a piano piece. While both involve nodes or notes that are activated sequentially or in parallel, the latter benefits from the rich body of music theory to guide meaningful combinations. We propose a novel approach that leverages musical grammar to regulate activations in a spiking neural network, allowing for the representation of symbols as attractors. By applying rules for chord progressions from music theory, we demonstrate how certain activations naturally follow others, akin to the concept of attraction. Furthermore, we introduce the concept of modulating keys to navigate different basins of attraction within the network. Ultimately, we show that the map of concepts in our model is structured by the musical circle of fifths, highlighting the potential for leveraging music theory principles in deep learning algorithms.
- [450] arXiv:2403.00791 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: $\textit{L+M-24}$: Building a Dataset for Language + Molecules @ ACL 2024Comments: The dataset, finetuned baselines, and evaluation code are released publicly at this https URL through this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
Abstract: Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.
- [451] arXiv:2403.00794 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language ModelsZachary Horvitz , Jingru Chen , Rahul Aditya , Harshvardhan Srivastava , Robert West , Zhou Yu , Kathleen McKeownSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. In our work, we investigate whether large language models (LLMs), can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to `unfun' jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.
- [452] arXiv:2403.00795 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Executing Natural Language-Described Algorithms with Large Language Models: An InvestigationComments: Accepted at LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Executing computer programs described in natural language has long been a pursuit of computer science. With the advent of enhanced natural language understanding capabilities exhibited by large language models (LLMs), the path toward this goal has been illuminated. In this paper, we seek to examine the capacity of present-day LLMs to comprehend and execute algorithms outlined in natural language. We established an algorithm test set sourced from Introduction to Algorithm, a well-known textbook that contains many representative widely-used algorithms. To systematically assess LLMs' code execution abilities, we selected 30 algorithms, generated 300 random-sampled instances in total, and evaluated whether popular LLMs can understand and execute these algorithms. Our findings reveal that LLMs, notably GPT-4, can effectively execute programs described in natural language, as long as no heavy numeric computation is involved. We believe our findings contribute to evaluating LLMs' code execution abilities and would encourage further investigation and application for the computation power of LLMs.
- [453] arXiv:2403.00796 (cross-list from q-fin.ST) [ pdf , ps , other ]
-
Title: Enhancing Mean-Reverting Time Series Prediction with Gaussian Processes: Functional and Augmented Data Structures in Financial ForecastingSubjects: Statistical Finance (q-fin.ST) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: In this paper, we explore the application of Gaussian Processes (GPs) for predicting mean-reverting time series with an underlying structure, using relatively unexplored functional and augmented data structures. While many conventional forecasting methods concentrate on the short-term dynamics of time series data, GPs offer the potential to forecast not just the average prediction but the entire probability distribution over a future trajectory. This is particularly beneficial in financial contexts, where accurate predictions alone may not suffice if incorrect volatility assessments lead to capital losses. Moreover, in trade selection, GPs allow for the forecasting of multiple Sharpe ratios adjusted for transaction costs, aiding in decision-making. The functional data representation utilized in this study enables longer-term predictions by leveraging information from previous years, even as the forecast moves away from the current year's training data. Additionally, the augmented representation enriches the training set by incorporating multiple targets for future points in time, facilitating long-term predictions. Our implementation closely aligns with the methodology outlined in, which assessed effectiveness on commodity futures. However, our testing methodology differs. Instead of real data, we employ simulated data with similar characteristics. We construct a testing environment to evaluate both data representations and models under conditions of increasing noise, fat tails, and inappropriate kernels-conditions commonly encountered in practice. By simulating data, we can compare our forecast distribution over time against a full simulation of the actual distribution of our test set, thereby reducing the inherent uncertainty in testing time series models on real data. We enable feature prediction through augmentation and employ sub-sampling to ensure the feasibility of GPs.
- [454] arXiv:2403.00799 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: An Empirical Study of Data Ability Boundary in LLMs' Math ReasoningComments: 33 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) are displaying emergent abilities for math reasoning tasks,and there is a growing attention on enhancing the ability of open-source LLMs through supervised fine-tuning (SFT).In this paper, we aim to explore a general data strategy for supervised data to help optimize and expand math reasoning ability.Firstly, we determine the ability boundary of reasoning paths augmentation by identifying these paths' minimal optimal set.Secondly, we validate that different abilities of the model can be cumulatively enhanced by Mix of Minimal Optimal Sets of corresponding types of data, while our models MMOS achieve SOTA performance on series base models under much lower construction costs.Besides, we point out GSM-HARD is not really hard and today's LLMs no longer lack numerical robustness.Also, we provide an Auto Problem Generator for robustness testing and educational applications.Our code and data are publicly available at this https URL .
- [455] arXiv:2403.00800 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by Imitating Human Thought ProcessesComments: 12 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Although large language models demonstrate emergent abilities in solving math word problems, there is a challenging task in complex multi-step mathematical reasoning tasks. To improve model performance on mathematical reasoning tasks, previous work has conducted supervised fine-tuning on open-source models by improving the quality and quantity of data. In this paper, we propose a novel approach, named Brain, to imitate human thought processes to enhance mathematical reasoning abilities, using the Frontal Lobe Model to generate plans, and then employing the Parietal Lobe Model to generate code and execute to obtain answers. First, we achieve SOTA performance in comparison with Code LLaMA 7B based models through this method. Secondly, we find that plans can be explicitly extracted from natural language, code, or formal language. Our code and data are publicly available at this https URL .
- [456] arXiv:2403.00801 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Self-Retrieval: Building an Information Retrieval System with One Large Language ModelQiaoyu Tang , Jiawei Chen , Bowen Yu , Yaojie Lu , Cheng Fu , Haiyang Yu , Hongyu Lin , Fei Huang , Ben He , Xianpei Han , Le Sun , Yongbin LiSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The rise of large language models (LLMs) has transformed the role of information retrieval (IR) systems in the way to humans accessing information. Due to the isolated architecture and the limited interaction, existing IR systems are unable to fully accommodate the shift from directly providing information to humans to indirectly serving large language models. In this paper, we propose Self-Retrieval, an end-to-end, LLM-driven information retrieval architecture that can fully internalize the required abilities of IR systems into a single LLM and deeply leverage the capabilities of LLMs during IR process. Specifically, Self-retrieval internalizes the corpus to retrieve into a LLM via a natural language indexing architecture. Then the entire retrieval process is redefined as a procedure of document generation and self-assessment, which can be end-to-end executed using a single large language model. Experimental results demonstrate that Self-Retrieval not only significantly outperforms previous retrieval approaches by a large margin, but also can significantly boost the performance of LLM-driven downstream applications like retrieval augumented generation.
- [457] arXiv:2403.00802 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Towards a Theoretical Understanding of Two-Stage Recommender SystemsComments: 18 pages (including references and appendix), 1 figure, 2 tablesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Production-grade recommender systems rely heavily on a large-scale corpus used by online media services, including Netflix, Pinterest, and Amazon. These systems enrich recommendations by learning users' and items' embeddings projected in a low-dimensional space with two-stage models (two deep neural networks), which facilitate their embedding constructs to predict users' feedback associated with items. Despite its popularity for recommendations, its theoretical behaviors remain comprehensively unexplored. We study the asymptotic behaviors of the two-stage recommender that entail a strong convergence to the optimal recommender system. We establish certain theoretical properties and statistical assurance of the two-stage recommender. In addition to asymptotic behaviors, we demonstrate that the two-stage recommender system attains faster convergence by relying on the intrinsic dimensions of the input features. Finally, we show numerically that the two-stage recommender enables encapsulating the impacts of items' and users' attributes on ratings, resulting in better performance compared to existing methods conducted using synthetic and real-world data experiments.
- [458] arXiv:2403.00803 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: LiMAML: Personalization of Deep Recommender Models via Meta LearningRuofan Wang , Prakruthi Prabhakar , Gaurav Srivastava , Tianqi Wang , Zeinab S. Jalali , Varun Bharill , Yunbo Ouyang , Aastha Nigam , Divya Venugopalan , Aman Gupta , Fedor Borisyuk , Sathiya Keerthi , Ajith MuralidharanSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In the realm of recommender systems, the ubiquitous adoption of deep neural networks has emerged as a dominant paradigm for modeling diverse business objectives. As user bases continue to expand, the necessity of personalization and frequent model updates have assumed paramount significance to ensure the delivery of relevant and refreshed experiences to a diverse array of members. In this work, we introduce an innovative meta-learning solution tailored to the personalization of models for individual members and other entities, coupled with the frequent updates based on the latest user interaction signals. Specifically, we leverage the Model-Agnostic Meta Learning (MAML) algorithm to adapt per-task sub-networks using recent user interaction data. Given the near infeasibility of productionizing original MAML-based models in online recommendation systems, we propose an efficient strategy to operationalize meta-learned sub-networks in production, which involves transforming them into fixed-sized vectors, termed meta embeddings, thereby enabling the seamless deployment of models with hundreds of billions of parameters for online serving. Through extensive experimentation on production data drawn from various applications at LinkedIn, we demonstrate that the proposed solution consistently outperforms the baseline models of those applications, including strong baselines such as using wide-and-deep ID based personalization approach. Our approach has enabled the deployment of a range of highly personalized AI models across diverse LinkedIn applications, leading to substantial improvements in business metrics as well as refreshed experience for our members.
- [459] arXiv:2403.00804 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Uncovering Customer Issues through Topological Natural Language AnalysisComments: Accepted in KDD 2023 Workshop on Decision Intelligence and Analytics for Online MarketplacesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: E-commerce companies deal with a high volume of customer service requests daily. While a simple annotation system is often used to summarize the topics of customer contacts, thoroughly exploring each specific issue can be challenging. This presents a critical concern, especially during an emerging outbreak where companies must quickly identify and address specific issues. To tackle this challenge, we propose a novel machine learning algorithm that leverages natural language techniques and topological data analysis to monitor emerging and trending customer issues. Our approach involves an end-to-end deep learning framework that simultaneously tags the primary question sentence of each customer's transcript and generates sentence embedding vectors. We then whiten the embedding vectors and use them to construct an undirected graph. From there, we define trending and emerging issues based on the topological properties of each transcript. We have validated our results through various methods and found that they are highly consistent with news sources.
- [460] arXiv:2403.00808 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: IPED: An Implicit Perspective for Relational Triple Extraction based on Diffusion ModelComments: 12 pages, 4 figures, committed to NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Relational triple extraction is a fundamental task in the field of information extraction, and a promising framework based on table filling has recently gained attention as a potential baseline for entity relation extraction. However, inherent shortcomings such as redundant information and incomplete triple recognition remain problematic. To address these challenges, we propose an Implicit Perspective for relational triple Extraction based on Diffusion model (IPED), an innovative approach for extracting relational triples. Our classifier-free solution adopts an implicit strategy using block coverage to complete the tables, avoiding the limitations of explicit tagging methods. Additionally, we introduce a generative model structure, the block-denoising diffusion model, to collaborate with our implicit perspective and effectively circumvent redundant information disruptions. Experimental results on two popular datasets demonstrate that IPED achieves state-of-the-art performance while gaining superior inference speed and low computational complexity. To support future research, we have made our source code publicly available online.
- [461] arXiv:2403.00809 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Abdelhak at SemEval-2024 Task 9 : Decoding Brainteasers, The Efficacy of Dedicated Models Versus ChatGPTSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This study introduces a dedicated model aimed at solving the BRAINTEASER task 9 , a novel challenge designed to assess models lateral thinking capabilities through sentence and word puzzles. Our model demonstrates remarkable efficacy, securing Rank 1 in sentence puzzle solving during the test phase with an overall score of 0.98. Additionally, we explore the comparative performance of ChatGPT, specifically analyzing how variations in temperature settings affect its ability to engage in lateral thinking and problem-solving. Our findings indicate a notable performance disparity between the dedicated model and ChatGPT, underscoring the potential of specialized approaches in enhancing creative reasoning in AI.
- [462] arXiv:2403.00812 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LoRA Meets Dropout under a Unified FrameworkSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: With the remarkable capabilities, large language models (LLMs) have emerged as essential elements in numerous NLP applications, while parameter-efficient finetuning, especially LoRA, has gained popularity as a lightweight approach for model customization. Meanwhile, various dropout methods, initially designed for full finetuning with all the parameters updated, alleviates overfitting associated with excessive parameter redundancy. Hence, a possible contradiction arises from negligible trainable parameters of LoRA and the effectiveness of previous dropout methods, which has been largely overlooked. To fill this gap, we first confirm that parameter-efficient LoRA is also overfitting-prone. We then revisit transformer-specific dropout methods, and establish their equivalence and distinctions mathematically and empirically. Building upon this comparative analysis, we introduce a unified framework for a comprehensive investigation, which instantiates these methods based on dropping position, structural pattern and compensation measure. Through this framework, we reveal the new preferences and performance comparisons of them when involved with limited trainable parameters. This framework also allows us to amalgamate the most favorable aspects into a novel dropout method named HiddenKey. Extensive experiments verify the remarkable superiority and sufficiency of HiddenKey across multiple models and tasks, which highlights it as the preferred approach for high-performance and parameter-efficient finetuning of LLMs.
- [463] arXiv:2403.00813 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: UrbanGPT: Spatio-Temporal Large Language ModelsComments: 11 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Spatio-temporal prediction aims to forecast and gain insights into the ever-changing dynamics of urban environments across both time and space. Its purpose is to anticipate future patterns, trends, and events in diverse facets of urban life, including transportation, population movement, and crime rates. Although numerous efforts have been dedicated to developing neural network techniques for accurate predictions on spatio-temporal data, it is important to note that many of these methods heavily depend on having sufficient labeled data to generate precise spatio-temporal representations. Unfortunately, the issue of data scarcity is pervasive in practical urban sensing scenarios. Consequently, it becomes necessary to build a spatio-temporal model with strong generalization capabilities across diverse spatio-temporal learning scenarios. Taking inspiration from the remarkable achievements of large language models (LLMs), our objective is to create a spatio-temporal LLM that can exhibit exceptional generalization capabilities across a wide range of downstream urban tasks. To achieve this objective, we present the UrbanGPT, which seamlessly integrates a spatio-temporal dependency encoder with the instruction-tuning paradigm. This integration enables LLMs to comprehend the complex inter-dependencies across time and space, facilitating more comprehensive and accurate predictions under data scarcity. To validate the effectiveness of our approach, we conduct extensive experiments on various public datasets, covering different spatio-temporal prediction tasks. The results consistently demonstrate that our UrbanGPT, with its carefully designed architecture, consistently outperforms state-of-the-art baselines. These findings highlight the potential of building large language models for spatio-temporal learning, particularly in zero-shot scenarios where labeled data is scarce.
- [464] arXiv:2403.00815 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: RAM-EHR: Retrieval Augmentation Meets Clinical Predictions on Electronic Health RecordsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Other Quantitative Biology (q-bio.OT)
Abstract: We present RAM-EHR, a Retrieval AugMentation pipeline to improve clinical predictions on Electronic Health Records (EHRs). RAM-EHR first collects multiple knowledge sources, converts them into text format, and uses dense retrieval to obtain information related to medical concepts. This strategy addresses the difficulties associated with complex names for the concepts. RAM-EHR then augments the local EHR predictive model co-trained with consistency regularization to capture complementary information from patient visits and summarized knowledge. Experiments on two EHR datasets show the efficacy of RAM-EHR over previous knowledge-enhanced baselines (3.4% gain in AUROC and 7.2% gain in AUPR), emphasizing the effectiveness of the summarized knowledge from RAM-EHR for clinical prediction tasks. The code will be published at \url{ this https URL }.
- [465] arXiv:2403.00816 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: CFRet-DVQA: Coarse-to-Fine Retrieval and Efficient Tuning for Document Visual Question AnsweringSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Document Visual Question Answering (DVQA) is a task that involves responding to queries based on the content of images. Existing work is limited to locating information within a single page and does not facilitate cross-page question-and-answer interaction. Furthermore, the token length limitation imposed on inputs to the model may lead to truncation of segments pertinent to the answer. In this study, we introduce a simple but effective methodology called CFRet-DVQA, which focuses on retrieval and efficient tuning to address this critical issue effectively. For that, we initially retrieve multiple segments from the document that correlate with the question at hand. Subsequently, we leverage the advanced reasoning abilities of the large language model (LLM), further augmenting its performance through instruction tuning. This approach enables the generation of answers that align with the style of the document labels. The experiments demonstrate that our methodology achieved state-of-the-art or competitive results with both single-page and multi-page documents in various fields.
- [466] arXiv:2403.00822 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: InteraRec: Interactive Recommendations Using Multimodal Large Language ModelsSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Weblogs, comprised of records detailing user activities on any website, offer valuable insights into user preferences, behavior, and interests. Numerous recommendation algorithms, employing strategies such as collaborative filtering, content-based filtering, and hybrid methods, leverage the data mined through these weblogs to provide personalized recommendations to users. Despite the abundance of information available in these weblogs, identifying and extracting pertinent information and key features necessitates extensive engineering endeavors. The intricate nature of the data also poses a challenge for interpretation, especially for non-experts. In this study, we introduce a sophisticated and interactive recommendation framework denoted as InteraRec, which diverges from conventional approaches that exclusively depend on weblogs for recommendation generation. This framework captures high-frequency screenshots of web pages as users navigate through a website. Leveraging state-of-the-art multimodal large language models (MLLMs), it extracts valuable insights into user preferences from these screenshots by generating a user behavioral summary based on predefined keywords. Subsequently, this summary is utilized as input to an LLM-integrated optimization setup to generate tailored recommendations. Through our experiments, we demonstrate the effectiveness of InteraRec in providing users with valuable and personalized offerings.
- [467] arXiv:2403.00824 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Information Flow Routes: Automatically Interpreting Language Models at ScaleSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Information flows by routes inside the network via mechanisms implemented in the model. These routes can be represented as graphs where nodes correspond to token representations and edges to operations inside the network. We automatically build these graphs in a top-down manner, for each prediction leaving only the most important nodes and edges. In contrast to the existing workflows relying on activation patching, we do this through attribution: this allows us to efficiently uncover existing circuits with just a single forward pass. Additionally, the applicability of our method is far beyond patching: we do not need a human to carefully design prediction templates, and we can extract information flow routes for any prediction (not just the ones among the allowed templates). As a result, we can talk about model behavior in general, for specific types of predictions, or different domains. We experiment with Llama 2 and show that the role of some attention heads is overall important, e.g. previous token heads and subword merging heads. Next, we find similarities in Llama 2 behavior when handling tokens of the same part of speech. Finally, we show that some model components can be specialized on domains such as coding or multilingual texts.
- [468] arXiv:2403.00827 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Self-Refinement of Language Models from External Proxy Metrics FeedbackKeshav Ramji , Young-Suk Lee , Ramón Fernandez Astudillo , Md Arafat Sultan , Tahira Naseem , Asim Munawar , Radu Florian , Salim RoukosSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: It is often desirable for Large Language Models (LLMs) to capture multiple objectives when providing a response. In document-grounded response generation, for example, agent responses are expected to be relevant to a user's query while also being grounded in a given document. In this paper, we introduce Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response. ProMiSe leverages feedback on response quality through principle-specific proxy metrics, and iteratively refines its response one principle at a time. We apply ProMiSe to open source language models Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its performance on document-grounded question answering datasets, MultiDoc2Dial and QuAC, demonstrating that self-refinement improves response quality. We further show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated by ProMiSe yields significant performance improvements over the zero-shot baseline as well as a supervised fine-tuned model on human annotated data.
- [469] arXiv:2403.00828 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Deep Learning Detection Method for Large Language Models-Generated Scientific ContentSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs), such as GPT-3 and BERT, reshape how textual content is written and communicated. These models have the potential to generate scientific content that is indistinguishable from that written by humans. Hence, LLMs carry severe consequences for the scientific community, which relies on the integrity and reliability of publications. This research paper presents a novel ChatGPT-generated scientific text detection method, AI-Catcher. AI-Catcher integrates two deep learning models, multilayer perceptron (MLP) and convolutional neural networks (CNN). The MLP learns the feature representations of the linguistic and statistical features. The CNN extracts high-level representations of the sequential patterns from the textual content. AI-Catcher is a multimodal model that fuses hidden patterns derived from MLP and CNN. In addition, a new ChatGPT-Generated scientific text dataset is collected to enhance AI-generated text detection tools, AIGTxt. AIGTxt contains 3000 records collected from published academic articles across ten domains and divided into three classes: Human-written, ChatGPT-generated, and Mixed text. Several experiments are conducted to evaluate the performance of AI-Catcher. The comparative results demonstrate the capability of AI-Catcher to distinguish between human-written and ChatGPT-generated scientific text more accurately than alternative methods. On average, AI-Catcher improved accuracy by 37.4%.
- [470] arXiv:2403.00832 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Explainable Session-based Recommendation via Path ReasoningSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: This paper explores providing explainability for session-based recommendation (SR) by path reasoning. Current SR models emphasize accuracy but lack explainability, while traditional path reasoning prioritizes knowledge graph exploration, ignoring sequential patterns present in the session history. Therefore, we propose a generalized hierarchical reinforcement learning framework for SR, which improves the explainability of existing SR models via Path Reasoning, namely PR4SR. Considering the different importance of items to the session, we design the session-level agent to select the items in the session as the starting point for path reasoning and the path-level agent to perform path reasoning. In particular, we design a multi-target reward mechanism to adapt to the skip behaviors of sequential patterns in SR, and introduce path midpoint reward to enhance the exploration efficiency in knowledge graphs. To improve the completeness of the knowledge graph and to diversify the paths of explanation, we incorporate extracted feature information from images into the knowledge graph. We instantiate PR4SR in five state-of-the-art SR models (i.e., GRU4REC, NARM, GCSAN, SR-GNN, SASRec) and compare it with other explainable SR frameworks, to demonstrate the effectiveness of PR4SR for recommendation and explanation tasks through extensive experiments with these approaches on four datasets.
- [471] arXiv:2403.00834 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Virtual Reality for Understanding Artificial-Intelligence-driven Scientific Discovery with an Application in Quantum OpticsComments: 12 pages, 6 figures, comments welcomeSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Quantum Physics (quant-ph)
Abstract: Generative Artificial Intelligence (AI) models can propose solutions to scientific problems beyond human capability. To truly make conceptual contributions, researchers need to be capable of understanding the AI-generated structures and extracting the underlying concepts and ideas. When algorithms provide little explanatory reasoning alongside the output, scientists have to reverse-engineer the fundamental insights behind proposals based solely on examples. This task can be challenging as the output is often highly complex and thus not immediately accessible to humans. In this work we show how transferring part of the analysis process into an immersive Virtual Reality (VR) environment can assist researchers in developing an understanding of AI-generated solutions. We demonstrate the usefulness of VR in finding interpretable configurations of abstract graphs, representing Quantum Optics experiments. Thereby, we can manually discover new generalizations of AI-discoveries as well as new understanding in experimental quantum optics. Furthermore, it allows us to customize the search space in an informed way - as a human-in-the-loop - to achieve significantly faster subsequent discovery iterations. As concrete examples, with this technology, we discover a new resource-efficient 3-dimensional entanglement swapping scheme, as well as a 3-dimensional 4-particle Greenberger-Horne-Zeilinger-state analyzer. Our results show the potential of VR for increasing a human researcher's ability to derive knowledge from graph-based generative AI that, which is a common abstract data representation used in diverse fields of science.
- [472] arXiv:2403.00835 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CLLMs: Consistency Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation. However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step. To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input. Extensive experiments demonstrate the effectiveness of our method, showing 2.4$\times$ to 3.4$\times$ improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks.
- [473] arXiv:2403.00840 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: EyeGPT: Ophthalmic Assistant with Large Language ModelsXiaolan Chen , Ziwei Zhao , Weiyi Zhang , Pusheng Xu , Le Gao , Mingpu Xu , Yue Wu , Yinwen Li , Danli Shi , Mingguang HeComments: 47 pages, 4 figures, 1 table, 2 supplementary figures and 9 supplementary tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Artificial intelligence (AI) has gained significant attention in healthcare consultation due to its potential to improve clinical workflow and enhance medical communication. However, owing to the complex nature of medical information, large language models (LLM) trained with general world knowledge might not possess the capability to tackle medical-related tasks at an expert level. Here, we introduce EyeGPT, a specialized LLM designed specifically for ophthalmology, using three optimization strategies including role-playing, finetuning, and retrieval-augmented generation. In particular, we proposed a comprehensive evaluation framework that encompasses a diverse dataset, covering various subspecialties of ophthalmology, different users, and diverse inquiry intents. Moreover, we considered multiple evaluation metrics, including accuracy, understandability, trustworthiness, empathy, and the proportion of hallucinations. By assessing the performance of different EyeGPT variants, we identify the most effective one, which exhibits comparable levels of understandability, trustworthiness, and empathy to human ophthalmologists (all Ps>0.05). Overall, ur study provides valuable insights for future research, facilitating comprehensive comparisons and evaluations of different strategies for developing specialized LLMs in ophthalmology. The potential benefits include enhancing the patient experience in eye care and optimizing ophthalmologists' services.
- [474] arXiv:2403.00841 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Offline Fictitious Self-Play for Competitive GamesSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Abstract: Offline Reinforcement Learning (RL) has received significant interest due to its ability to improve policies in previously collected datasets without online interactions. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces Off-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn best responses to different opponents and employ the Offline Self-Play learning framework. In this framework, we further implement Fictitious Self-Play (FSP) to approximate NE. In partially covered real-world datasets, our methods show the potential to approach NE by incorporating any single-agent offline RL method. Experimental results in Leduc Hold'em Poker show that our method significantly improves performances compared with state-of-the-art baselines.
- [475] arXiv:2403.00843 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Large Language Models are Learnable Planners for Long-Term RecommendationWentao Shi , Xiangnan He , Yang Zhang , Chongming Gao , Xinyue Li , Jizhi Zhang , Qifan Wang , Fuli FengComments: 11 pages, 5 figuresSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Planning for both immediate and long-term benefits becomes increasingly important in recommendation. Existing methods apply Reinforcement Learning (RL) to learn planning capacity by maximizing cumulative reward for long-term recommendation. However, the scarcity of recommendation data presents challenges such as instability and susceptibility to overfitting when training RL models from scratch, resulting in sub-optimal performance. In this light, we propose to leverage the remarkable planning capabilities over sparse data of Large Language Models (LLMs) for long-term recommendation. The key to achieving the target lies in formulating a guidance plan following principles of enhancing long-term engagement and grounding the plan to effective and executable actions in a personalized manner. To this end, we propose a Bi-level Learnable LLM Planner framework, which consists of a set of LLM instances and breaks down the learning process into macro-learning and micro-learning to learn macro-level guidance and micro-level personalized recommendation policies, respectively. Extensive experiments validate that the framework facilitates the planning ability of LLMs for long-term recommendation. Our code and data can be found at this https URL .
- [476] arXiv:2403.00854 (cross-list from q-bio.NC) [ pdf , ps , html , other ]
-
Title: Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task LearningComments: 17 pages, 2 tables, 4 main figures, 2 supplemental figures, prepared for journal submissionSubjects: Neurons and Cognition (q-bio.NC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Dysarthria, a condition resulting from impaired control of the speech muscles due to neurological disorders, significantly impacts the communication and quality of life of patients. The condition's complexity, human scoring and varied presentations make its assessment and management challenging. This study presents a transformer-based framework for automatically assessing dysarthria severity from raw speech data. It can offer an objective, repeatable, accessible, standardised and cost-effective and compared to traditional methods requiring human expert assessors. We develop a transformer framework, called Speaker-Agnostic Latent Regularisation (SALR), incorporating a multi-task learning objective and contrastive learning for speaker-independent multi-class dysarthria severity classification. The multi-task framework is designed to reduce reliance on speaker-specific characteristics and address the intrinsic intra-class variability of dysarthric speech. We evaluated on the Universal Access Speech dataset using leave-one-speaker-out cross-validation, our model demonstrated superior performance over traditional machine learning approaches, with an accuracy of $70.48\%$ and an F1 score of $59.23\%$. Our SALR model also exceeded the previous benchmark for AI-based classification, which used support vector machines, by $16.58\%$. We open the black box of our model by visualising the latent space where we can observe how the model substantially reduces speaker-specific cues and amplifies task-specific ones, thereby showing its robustness. In conclusion, SALR establishes a new benchmark in speaker-independent multi-class dysarthria severity classification using generative AI. The potential implications of our findings for broader clinical applications in automated dysarthria severity assessments.
- [477] arXiv:2403.00858 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMsComments: 8 pages, 3 figures, Published at the ICLR 2024 Workshop on Understanding of Foundation Models (ME-FoMo)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Text generation with Large Language Models (LLMs) is known to be memory bound due to the combination of their auto-regressive nature, huge parameter counts, and limited memory bandwidths, often resulting in low token rates. Speculative decoding has been proposed as a solution for LLM inference acceleration. However, since draft models are often unavailable in the modern open-source LLM families, e.g., for Llama 2 7B, training a high-quality draft model is required to enable inference acceleration via speculative decoding. In this paper, we propose a simple draft model training framework for direct alignment to chat-capable target models. With the proposed framework, we train Llama 2 Chat Drafter 115M, a draft model for Llama 2 Chat 7B or larger, with only 1.64\% of the original size. Our training framework only consists of pretraining, distillation dataset generation, and finetuning with knowledge distillation, with no additional alignment procedure. For the finetuning step, we use instruction-response pairs generated by target model for distillation in plausible data distribution, and propose a new Total Variation Distance++ (TVD++) loss that incorporates variance reduction techniques inspired from the policy gradient method in reinforcement learning. Our empirical results show that Llama 2 Chat Drafter 115M with speculative decoding achieves up to 2.3 block efficiency and 2.4$\times$ speed-up relative to autoregressive decoding on various tasks with no further task-specific fine-tuning.
- [478] arXiv:2403.00860 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Parallel Algorithms for Exact Enumeration of Deep Neural Network Activation RegionsSabrina Drammis , Bowen Zheng , Karthik Srinivasan , Robert C. Berwick , Nancy A. Lynch , Robert AjemianSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: A feedforward neural network using rectified linear units constructs a mapping from inputs to outputs by partitioning its input space into a set of convex regions where points within a region share a single affine transformation. In order to understand how neural networks work, when and why they fail, and how they compare to biological intelligence, we need to understand the organization and formation of these regions. Step one is to design and implement algorithms for exact region enumeration in networks beyond toy examples.
In this work, we present parallel algorithms for exact enumeration in deep (and shallow) neural networks. Our work has three main contributions: (1) we present a novel algorithm framework and parallel algorithms for region enumeration; (2) we implement one of our algorithms on a variety of network architectures and experimentally show how the number of regions dictates runtime; and (3) we show, using our algorithm's output, how the dimension of a region's affine transformation impacts further partitioning of the region by deeper layers.
To our knowledge, we run our implemented algorithm on networks larger than all of the networks used in the existing region enumeration literature. Further, we experimentally demonstrate the importance of parallelism for region enumeration of any reasonably sized network. - [479] arXiv:2403.00862 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: NewsBench: Systematic Evaluation of LLMs for Writing Proficiency and Safety Adherence in Chinese Journalistic Editorial ApplicationsMiao Li , Ming-Bin Chen , Bo Tang , Shengbin Hou , Pengyu Wang , Haiying Deng , Zhiyu Li , Feiyu Xiong , Keming Mao , Peng Cheng , Yi LuoComments: 27 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This study presents NewsBench, a novel benchmark framework developed to evaluate the capability of Large Language Models (LLMs) in Chinese Journalistic Writing Proficiency (JWP) and their Safety Adherence (SA), addressing the gap between journalistic ethics and the risks associated with AI utilization. Comprising 1,267 tasks across 5 editorial applications, 7 aspects (including safety and journalistic writing with 4 detailed facets), and spanning 24 news topics domains, NewsBench employs two GPT-4 based automatic evaluation protocols validated by human assessment. Our comprehensive analysis of 10 LLMs highlighted GPT-4 and ERNIE Bot as top performers, yet revealed a relative deficiency in journalistic ethic adherence during creative writing tasks. These findings underscore the need for enhanced ethical guidance in AI-generated journalistic content, marking a step forward in aligning AI capabilities with journalistic standards and safety considerations.
- [480] arXiv:2403.00863 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value ExtractionChenhao Fang , Xiaohan Li , Zezhong Fan , Jianpeng Xu , Kaushiki Nag , Evren Korpeoglu , Sushant Kumar , Kannan AchanSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Product attribute value extraction is a pivotal component in Natural Language Processing (NLP) and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging Large Language Models (LLMs) have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).
- [481] arXiv:2403.00865 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Fast and Efficient Local Search for Genetic Programming Based Loss Function LearningComments: arXiv admin note: substantial text overlap with arXiv:2209.08907Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: In this paper, we develop upon the topic of loss function learning, an emergent meta-learning paradigm that aims to learn loss functions that significantly improve the performance of the models trained under them. Specifically, we propose a new meta-learning framework for task and model-agnostic loss function learning via a hybrid search approach. The framework first uses genetic programming to find a set of symbolic loss functions. Second, the set of learned loss functions is subsequently parameterized and optimized via unrolled differentiation. The versatility and performance of the proposed framework are empirically validated on a diverse set of supervised learning tasks. Results show that the learned loss functions bring improved convergence, sample efficiency, and inference performance on tabulated, computer vision, and natural language processing problems, using a variety of task-specific neural network architectures.
- [482] arXiv:2403.00867 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss LandscapesComments: Project page: this https URLSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, this paper defines and investigates the Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect jailbreak attempts. Gradient Cuff exploits the unique properties observed in the refusal loss landscape, including functional values and its smoothness, to design an effective two-step detection strategy. Experimental results on two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can significantly improve the LLM's rejection capability for malicious jailbreak queries, while maintaining the model's performance for benign user queries by adjusting the detection threshold.
- [483] arXiv:2403.00868 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SoftTiger: A Clinical Foundation Model for Healthcare WorkflowsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We introduce SoftTiger, a clinical large language model (CLaM) designed as a foundation model for healthcare workflows. The narrative and unstructured nature of clinical notes is a major obstacle for healthcare intelligentization. We address a critical problem of structuring clinical notes into clinical data, according to international interoperability standards. We collect and annotate data for three subtasks, namely, international patient summary, clinical impression and medical encounter. We then supervised fine-tuned a state-of-the-art LLM using public and credentialed clinical data. The training is orchestrated in a way that the target model can first support basic clinical tasks such as abbreviation expansion and temporal information extraction, and then learn to perform more complex downstream clinical tasks. Moreover, we address several modeling challenges in the healthcare context, e.g., extra long context window. Our blind pairwise evaluation shows that SoftTiger outperforms other popular open-source models and GPT-3.5, comparable to Gemini-pro, with a mild gap from GPT-4. We believe that LLMs may become a step-stone towards healthcare digitalization and democratization. Therefore, we publicly release SoftTiger models at scales of 13 billion and 70 billion parameters, as well as datasets and code for our innovative scalable evaluation, hopefully, making a significant contribution to the healthcare industry.
- [484] arXiv:2403.00871 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Teach LLMs to Phish: Stealing Private Information from Language ModelsComments: ICLR 2024Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: When large language models are trained on private data, it can be a significant privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new practical data extraction attack that we call "neural phishing". This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit card numbers, from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%. Our attack assumes only that an adversary can insert as few as 10s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.
- [485] arXiv:2403.00872 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: DFIN-SQL: Integrating Focused Schema with DIN-SQL for Superior Accuracy in Large-Scale DatabasesSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI)
Abstract: The task of converting natural language queries into SQL queries is intricate, necessitating a blend of precise techniques for an accurate translation. The DIN-SQL (Decomposed-In-Context SQL) methodology represents a significant development in this domain. This paper introduces DFIN (Decomposed Focused-In-Context), an innovative extension of DIN-SQL that enhances Text-to-SQL conversion by addressing schema linking errors, which are a major source of inaccuracies. DFIN uniquely alternates between prompting techniques and Retrieval-Augmented Generation (RAG), adapting to the size and complexity of the database schema. A preprocessing phase embeds database definitions and leverages annotated files, akin to those in the BIRD dataset, facilitating the runtime retrieval of pertinent schema information. This strategy significantly reduces the token count for schema linking prompts, enabling the use of a standard GPT-4 model over its larger context variant, thus handling large-scale databases more effectively and economically. Our evaluation on the BIRD dataset, a challenging real-world benchmark, demonstrates that DFIN not only scales efficiently but also improves accuracy, achieving a score of 51.69. This improvement surpasses DIN-SQL method (the current third-place), which is the highest-ranked model employing in-context learning rather than fine-tuning, previously scoring 50.72. The advancement of DFIN underscores the evolving capabilities of in-context learning methodologies combined with advanced language models, offering a promising avenue for future research in complex Text-to-SQL conversion tasks.
- [486] arXiv:2403.00875 (cross-list from q-bio.QM) [ pdf , ps , html , other ]
-
Title: Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New DirectionsSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Abstract: Augmentation is an effective alternative to utilize the small amount of labeled protein data. However, most of the existing work focuses on design-ing new architectures or pre-training tasks, and relatively little work has studied data augmentation for proteins. This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks, providing the first comprehensive evaluation of protein augmentation. Furthermore, we propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution, which enable protein semantic-aware augmentation through saliency detection and biological knowledge. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA), which can adaptively select the most suitable augmentation combinations for different tasks. Extensive experiments have shown that APA enhances the performance of five protein related tasks by an average of 10.55% across three architectures compared to vanilla implementations without augmentation, highlighting its potential to make a great impact on the field.
- [487] arXiv:2403.00876 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Word Order and World KnowledgeSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Word order is an important concept in natural language, and in this work, we study how word order affects the induction of world knowledge from raw text using language models. We use word analogies to probe for such knowledge. Specifically, in addition to the natural word order, we first respectively extract texts of six fixed word orders from five languages and then pretrain the language models on these texts. Finally, we analyze the experimental results of the fixed word orders on word analogies and show that i) certain fixed word orders consistently outperform or underperform others, though the specifics vary across languages, and ii) the Wov2Lex hypothesis is not hold in pre-trained language models, and the natural word order typically yields mediocre results. The source code will be made publicly available at this https URL .
- [488] arXiv:2403.00878 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Crimson: Empowering Strategic Reasoning in Cybersecurity through Large Language ModelsJiandong Jin , Bowen Tang , Mingxuan Ma , Xiao Liu , Yunfei Wang , Qingnan Lai , Jia Yang , Changling ZhouComments: 9 pages, 7 figuresSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: We introduces Crimson, a system that enhances the strategic reasoning capabilities of Large Language Models (LLMs) within the realm of cybersecurity. By correlating CVEs with MITRE ATT&CK techniques, Crimson advances threat anticipation and strategic defense efforts. Our approach includes defining and evaluating cybersecurity strategic tasks, alongside implementing a comprehensive human-in-the-loop data-synthetic workflow to develop the CVE-to-ATT&CK Mapping (CVEM) dataset. We further enhance LLMs' reasoning abilities through a novel Retrieval-Aware Training (RAT) process and its refined iteration, RAT-R.
Our findings demonstrate that an LLM fine-tuned with our techniques, possessing 7 billion parameters, approaches the performance level of GPT-4, showing markedly lower rates of hallucination and errors, and surpassing other models in strategic reasoning tasks. Moreover, domain-specific fine-tuning of embedding models significantly improves performance within cybersecurity contexts, underscoring the efficacy of our methodology. By leveraging Crimson to convert raw vulnerability data into structured and actionable insights, we bolster proactive cybersecurity defenses. - [489] arXiv:2403.00880 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Dual-Granularity Medication Recommendation Based on Causal InferenceSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: As medical demands grow and machine learning technology advances, AI-based diagnostic and treatment systems are garnering increasing attention. Medication recommendation aims to integrate patients' long-term health records with medical knowledge, recommending accuracy and safe medication combinations for specific conditions. However, most existing researches treat medication recommendation systems merely as variants of traditional recommendation systems, overlooking the heterogeneity between medications and diseases. To address this challenge, we propose DGMed, a framework for medication recommendation. DGMed utilizes causal inference to uncover the connections among medical entities and presents an innovative feature alignment method to tackle heterogeneity issues. Specifically, this study first applies causal inference to analyze the quantified therapeutic effects of medications on specific diseases from historical records, uncovering potential links between medical entities. Subsequently, we integrate molecular-level knowledge, aligning the embeddings of medications and diseases within the molecular space to effectively tackle their heterogeneity. Ultimately, based on relationships at the entity level, we adaptively adjust the recommendation probabilities of medication and recommend medication combinations according to the patient's current health condition. Experimental results on a real-world dataset show that our method surpasses existing state-of-the-art baselines in four evaluation metrics, demonstrating superior performance in both accuracy and safety aspects. Compared to the sub-optimal model, our approach improved accuracy by 4.40%, reduced the risk of side effects by 6.14%, and increased time efficiency by 47.15%.
- [490] arXiv:2403.00884 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: Text classification of column headers with a controlled vocabulary: leveraging LLMs for metadata enrichmentSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Traditional dataset retrieval systems index on metadata information rather than on the data values. Thus relying primarily on manual annotations and high-quality metadata, processes known to be labour-intensive and challenging to automate. We propose a method to support metadata enrichment with topic annotations of column headers using three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard and GoogleGemini. We investigate the LLMs ability to classify column headers based on domain-specific topics from a controlled vocabulary. We evaluate our approach by assessing the internal consistency of the LLMs, the inter-machine alignment, and the human-machine agreement for the topic classification task. Additionally, we investigate the impact of contextual information (i.e. dataset description) on the classification outcomes. Our results suggest that ChatGPT and GoogleGemini outperform GoogleBard for internal consistency as well as LLM-human-alignment. Interestingly, we found that context had no impact on the LLMs performances. This work proposes a novel approach that leverages LLMs for text classification using a controlled topic vocabulary, which has the potential to facilitate automated metadata enrichment, thereby enhancing dataset retrieval and the Findability, Accessibility, Interoperability and Reusability (FAIR) of research data on the Web.
- [491] arXiv:2403.00887 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in SpeechAron R , Indra Sigicharla , Chirag Periwal , Mohanaprasad K , Nithya Darisini P S , Sourabh Tiwari , Shivani AroraSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Abstract: The interpretation of human voices holds importance across various applications. This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Voice analysis tech advancements span domains, from improving customer interactions to enhancing healthcare and retail experiences. Discerning emotions aids mental health, while age and gender detection are vital in various contexts. Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper. Sourcing suitable data posed challenges, resulting in the amalgamation of the CREMA-D and EMO-DB datasets. Prior work showed promise in individual predictions, but limited research considered all three variables simultaneously. This paper identifies flaws in an individual model approach and advocates for our novel multi-output learning architecture Speech-based Emotion Gender and Age Analysis (SEGAA) model. The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
- [492] arXiv:2403.00890 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Improving Android Malware Detection Through Data Augmentation Using Wasserstein Generative Adversarial NetworksComments: 20 pagesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Generative Adversarial Networks (GANs) have demonstrated their versatility across various applications, including data augmentation and malware detection. This research explores the effectiveness of utilizing GAN-generated data to train a model for the detection of Android malware. Given the considerable storage requirements of Android applications, the study proposes a method to synthetically represent data using GANs, thereby reducing storage demands. The proposed methodology involves creating image representations of features extracted from an existing dataset. A GAN model is then employed to generate a more extensive dataset consisting of realistic synthetic grayscale images. Subsequently, this synthetic dataset is utilized to train a Convolutional Neural Network (CNN) designed to identify previously unseen Android malware applications. The study includes a comparative analysis of the CNN's performance when trained on real images versus synthetic images generated by the GAN. Furthermore, the research explores variations in performance between the Wasserstein Generative Adversarial Network (WGAN) and the Deep Convolutional Generative Adversarial Network (DCGAN). The investigation extends to studying the impact of image size and malware obfuscation on the classification model's effectiveness. The data augmentation approach implemented in this study resulted in a notable performance enhancement of the classification model, ranging from 1.5% to 7%, depending on the dataset. The highest achieved F1 score reached 0.975.
Keywords--Generative Adversarial Networks, Android Malware, Data Augmentation, Wasserstein Generative Adversarial Network - [493] arXiv:2403.00891 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Regularization-based Transfer Learning Method for Information Extraction via Instructed Graph DecoderSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Information extraction (IE) aims to extract complex structured information from the text. Numerous datasets have been constructed for various IE tasks, leading to time-consuming and labor-intensive data annotations. Nevertheless, most prevailing methods focus on training task-specific models, while the common knowledge among different IE tasks is not explicitly modeled. Moreover, the same phrase may have inconsistent labels in different tasks, which poses a big challenge for knowledge transfer using a unified model. In this study, we propose a regularization-based transfer learning method for IE (TIE) via an instructed graph decoder. Specifically, we first construct an instruction pool for datasets from all well-known IE tasks, and then present an instructed graph decoder, which decodes various complex structures into a graph uniformly based on corresponding instructions. In this way, the common knowledge shared with existing datasets can be learned and transferred to a new dataset with new labels. Furthermore, to alleviate the label inconsistency problem among various IE tasks, we introduce a task-specific regularization strategy, which does not update the gradients of two tasks with 'opposite direction'. We conduct extensive experiments on 12 datasets spanning four IE tasks, and the results demonstrate the great advantages of our proposed method
- [494] arXiv:2403.00894 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: A systematic evaluation of large language models for generating programming codeSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
Abstract: We systematically evaluated the performance of seven large language models in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants. Additionally, GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development.
- [495] arXiv:2403.00895 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: End-to-End Graph-Sequential Representation Learning for Accurate RecommendationsComments: 4 pages, 1 figure, submitted to WWW'24, short-paper trackSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent recommender system advancements have focused on developing sequence-based and graph-based approaches. Both approaches proved useful in modeling intricate relationships within behavioral data, leading to promising outcomes in personalized ranking and next-item recommendation tasks while maintaining good scalability. However, they capture very different signals from data. While the former approach represents users directly through ordered interactions with recent items, the latter aims to capture indirect dependencies across the interactions graph. This paper presents a novel multi-representational learning framework exploiting these two paradigms' synergies. Our empirical evaluation on several datasets demonstrates that mutual training of sequential and graph components with the proposed framework significantly improves recommendations performance.
- [496] arXiv:2403.00896 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Since large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.
- [497] arXiv:2403.00897 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: VisRec: A Semi-Supervised Approach to Radio Interferometric Data ReconstructionSubjects: Image and Video Processing (eess.IV) ; Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Radio telescopes produce visibility data about celestial objects, but these data are sparse and noisy. As a result, images created on raw visibility data are of low quality. Recent studies have used deep learning models to reconstruct visibility data to get cleaner images. However, these methods rely on a substantial amount of labeled training data, which requires significant labeling effort from radio astronomers. Addressing this challenge, we propose VisRec, a model-agnostic semi-supervised learning approach to the reconstruction of visibility data. Specifically, VisRec consists of both a supervised learning module and an unsupervised learning module. In the supervised learning module, we introduce a set of data augmentation functions to produce diverse training examples. In comparison, the unsupervised learning module in VisRec augments unlabeled data and uses reconstructions from non-augmented visibility data as pseudo-labels for training. This hybrid approach allows VisRec to effectively leverage both labeled and unlabeled data. This way, VisRec performs well even when labeled data is scarce. Our evaluation results show that VisRec outperforms all baseline methods in reconstruction quality, robustness against common observation perturbation, and generalizability to different telescope configurations.
- [498] arXiv:2403.00929 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: PRIME: Scaffolding Manipulation Tasks with Behavior Primitives for Data-Efficient Imitation LearningSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Imitation learning has shown great potential for enabling robots to acquire complex manipulation behaviors. However, these algorithms suffer from high sample complexity in long-horizon tasks, where compounding errors accumulate over the task horizons. We present PRIME (PRimitive-based IMitation with data Efficiency), a behavior primitive-based framework designed for improving the data efficiency of imitation learning. PRIME scaffolds robot tasks by decomposing task demonstrations into primitive sequences, followed by learning a high-level control policy to sequence primitives through imitation learning. Our experiments demonstrate that PRIME achieves a significant performance improvement in multi-stage manipulation tasks, with 10-34% higher success rates in simulation over state-of-the-art baselines and 20-48% on physical hardware.
- [499] arXiv:2403.00930 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Scale-free Adversarial Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper initiates the study of scale-free learning in Markov Decision Processes (MDPs), where the scale of rewards/losses is unknown to the learner. We design a generic algorithmic framework, \underline{S}cale \underline{C}lipping \underline{B}ound (\texttt{SCB}), and instantiate this framework in both the adversarial Multi-armed Bandit (MAB) setting and the adversarial MDP setting. Through this framework, we achieve the first minimax optimal expected regret bound and the first high-probability regret bound in scale-free adversarial MABs, resolving an open problem raised in \cite{hadiji2023adaptation}. On adversarial MDPs, our framework also give birth to the first scale-free RL algorithm with a $\tilde{\mathcal{O}}(\sqrt{T})$ high-probability regret guarantee.
- [500] arXiv:2403.00942 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Resilience of Entropy Model in Distributed Neural NetworksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Distributed deep neural networks (DNNs) have emerged as a key technique to reduce communication overhead without sacrificing performance in edge computing systems. Recently, entropy coding has been introduced to further reduce the communication overhead. The key idea is to train the distributed DNN jointly with an entropy model, which is used as side information during inference time to adaptively encode latent representations into bit streams with variable length. To the best of our knowledge, the resilience of entropy models is yet to be investigated. As such, in this paper we formulate and investigate the resilience of entropy models to intentional interference (e.g., adversarial attacks) and unintentional interference (e.g., weather changes and motion blur). Through an extensive experimental campaign with 3 different DNN architectures, 2 entropy models and 4 rate-distortion trade-off factors, we demonstrate that the entropy attacks can increase the communication overhead by up to 95%. By separating compression features in frequency and spatial domain, we propose a new defense mechanism that can reduce the transmission overhead of the attacked input by about 9% compared to unperturbed data, with only about 2% accuracy loss. Importantly, the proposed defense mechanism is a standalone approach which can be applied in conjunction with approaches such as adversarial training to further improve robustness. Code will be shared for reproducibility.
- [501] arXiv:2403.00953 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: AutoRD: An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontologies-enhanced Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Objectives: Our objective is to create an end-to-end system called AutoRD, which automates extracting information from clinical text about rare diseases. We have conducted various tests to evaluate the performance of AutoRD and highlighted its strengths and limitations in this paper.
Materials and Methods: Our system, AutoRD, is a software pipeline involving data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implement this using large language models and medical knowledge graphs developed from open-source medical ontologies. We quantitatively evaluate our system on entity extraction, relation extraction, and the performance of knowledge graph construction.
Results: AutoRD achieves an overall F1 score of 47.3%, a 14.4% improvement compared to the base LLM. In detail, AutoRD achieves an overall entity extraction F1 score of 56.1% (rare_disease: 83.5%, disease: 35.8%, symptom_and_sign: 46.1%, anaphor: 67.5%) and an overall relation extraction F1 score of 38.6% (produces: 34.7%, increases_risk_of: 12.4%, is_a: 37.4%, is_acronym: 44.1%, is_synonym: 16.3%, anaphora: 57.5%). Our qualitative experiment also demonstrates that the performance in constructing the knowledge graph is commendable.
Discussion: AutoRD demonstrates the potential of LLM applications in rare disease detection. This improvement is attributed to several design, including the integration of ontologies-enhanced LLMs.
Conclusion: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs. It uses ontologies-enhanced LLMs for a robust medical knowledge base. The superior performance of AutoRD is validated by experimental evaluations, demonstrating the potential of LLMs in healthcare. - [502] arXiv:2403.00957 (cross-list from stat.ME) [ pdf , ps , html , other ]
-
Title: Resolution of Simpson's paradox via the common cause principleSubjects: Methodology (stat.ME) ; Artificial Intelligence (cs.AI); Probability (math.PR); Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)
Abstract: Simpson's paradox is an obstacle to establishing a probabilistic association between two events $a_1$ and $a_2$, given the third (lurking) random variable $B$. We focus on scenarios when the random variables $A$ (which combines $a_1$, $a_2$, and their complements) and $B$ have a common cause $C$ that need not be observed. Alternatively, we can assume that $C$ screens out $A$ from $B$. For such cases, the correct association between $a_1$ and $a_2$ is to be defined via conditioning over $C$. This set-up generalizes the original Simpson's paradox. Now its two contradicting options simply refer to two particular and different causes $C$. We show that if $B$ and $C$ are binary and $A$ is quaternary (the minimal and the most widespread situation for valid Simpson's paradox), the conditioning over any binary common cause $C$ establishes the same direction of the association between $a_1$ and $a_2$ as the conditioning over $B$ in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over $B$ and not its marginalization. For tertiary (unobserved) common causes $C$ all three options of Simpson's paradox become possible (i.e. marginalized, conditional, and none of them), and one needs prior information on $C$ to choose the right option.
- [503] arXiv:2403.00965 (cross-list from stat.AP) [ pdf , ps , other ]
-
Title: Binary Gaussian Copula Synthesis: A Novel Data Augmentation Technique to Advance ML-based Clinical Decision Support Systems for Early Prediction of Dialysis Among CKD PatientsSubjects: Applications (stat.AP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The Center for Disease Control estimates that over 37 million US adults suffer from chronic kidney disease (CKD), yet 9 out of 10 of these individuals are unaware of their condition due to the absence of symptoms in the early stages. It has a significant impact on patients' quality of life, particularly when it progresses to the need for dialysis. Early prediction of dialysis is crucial as it can significantly improve patient outcomes and assist healthcare providers in making timely and informed decisions. However, developing an effective machine learning (ML)-based Clinical Decision Support System (CDSS) for early dialysis prediction poses a key challenge due to the imbalanced nature of data. To address this challenge, this study evaluates various data augmentation techniques to understand their effectiveness on real-world datasets. We propose a new approach named Binary Gaussian Copula Synthesis (BGCS). BGCS is tailored for binary medical datasets and excels in generating synthetic minority data that mirrors the distribution of the original data. BGCS enhances early dialysis prediction by outperforming traditional methods in detecting dialysis patients. For the best ML model, Random Forest, BCGS achieved a 72% improvement, surpassing the state-of-the-art augmentation approaches. Also, we present a ML-based CDSS, designed to aid clinicians in making informed decisions. CDSS, which utilizes decision tree models, is developed to improve patient outcomes, identify critical variables, and thereby enable clinicians to make proactive decisions, and strategize treatment plans effectively for CKD patients who are more likely to require dialysis in the near future. Through comprehensive feature analysis and meticulous data preparation, we ensure that the CDSS's dialysis predictions are not only accurate but also actionable, providing a valuable tool in the management and treatment of CKD.
- [504] arXiv:2403.00975 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Equipment Health Assessment: Time Series Analysis for Wind Turbine PerformanceJana Backhus , Aniruddha Rajendra Rao , Chandrasekar Venkatraman , Abhishek Padmanabhan , A.Vinoth Kumar , Chetan GuptaComments: 19 Pages, 17 Figures, 3 Tables, Submitted at Applied Sciences (MDPI)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Functional Analysis (math.FA); Applications (stat.AP)
Abstract: In this study, we leverage SCADA data from diverse wind turbines to predict power output, employing advanced time series methods, specifically Functional Neural Networks (FNN) and Long Short-Term Memory (LSTM) networks. A key innovation lies in the ensemble of FNN and LSTM models, capitalizing on their collective learning. This ensemble approach outperforms individual models, ensuring stable and accurate power output predictions. Additionally, machine learning techniques are applied to detect wind turbine performance deterioration, enabling proactive maintenance strategies and health assessment. Crucially, our analysis reveals the uniqueness of each wind turbine, necessitating tailored models for optimal predictions. These insight underscores the importance of providing automatized customization for different turbines to keep human modeling effort low. Importantly, the methodologies developed in this analysis are not limited to wind turbines; they can be extended to predict and optimize performance in various machinery, highlighting the versatility and applicability of our research across diverse industrial contexts.
- [505] arXiv:2403.00986 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Merging Text Transformer Models from Different InitializationsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent work on one-shot permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations. However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain. Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging for several models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.
- [506] arXiv:2403.00993 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: On the Role of Information Structure in Reinforcement Learning for Partially-Observable Sequential Teams and GamesComments: 57 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: In a sequential decision-making problem, the information structure is the description of how events in the system occurring at different points in time affect each other. Classical models of reinforcement learning (e.g., MDPs, POMDPs, Dec-POMDPs, and POMGs) assume a very simple and highly regular information structure, while more general models like predictive state representations do not explicitly model the information structure. By contrast, real-world sequential decision-making problems typically involve a complex and time-varying interdependence of system variables, requiring a rich and flexible representation of information structure.
In this paper, we argue for the perspective that explicit representation of information structures is an important component of analyzing and solving reinforcement learning problems. We propose novel reinforcement learning models with an explicit representation of information structure, capturing classical models as special cases. We show that this leads to a richer analysis of sequential decision-making problems and enables more tailored algorithm design. In particular, we characterize the "complexity" of the observable dynamics of any sequential decision-making problem through a graph-theoretic analysis of the DAG representation of its information structure. The central quantity in this analysis is the minimal set of variables that $d$-separates the past observations from future observations. Furthermore, through constructing a generalization of predictive state representations, we propose tailored reinforcement learning algorithms and prove that the sample complexity is in part determined by the information structure. This recovers known tractability results and gives a novel perspective on reinforcement learning in general sequential decision-making problems, providing a systematic way of identifying new tractable classes of problems. - [507] arXiv:2403.00994 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: Leveraging Prompt-Based Large Language Models: Predicting Pandemic Health Decisions and Outcomes Through Social Media LanguageComments: 20 pages, 4 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Abstract: We introduce a multi-step reasoning framework using prompt-based LLMs to examine the relationship between social media language patterns and trends in national health outcomes. Grounded in fuzzy-trace theory, which emphasizes the importance of gists of causal coherence in effective health communication, we introduce Role-Based Incremental Coaching (RBIC), a prompt-based LLM framework, to identify gists at-scale. Using RBIC, we systematically extract gists from subreddit discussions opposing COVID-19 health measures (Study 1). We then track how these gists evolve across key events (Study 2) and assess their influence on online engagement (Study 3). Finally, we investigate how the volume of gists is associated with national health trends like vaccine uptake and hospitalizations (Study 4). Our work is the first to empirically link social media linguistic patterns to real-world public health trends, highlighting the potential of prompt-based LLMs in identifying critical online discussion patterns that can form the basis of public health communication strategies.
- [508] arXiv:2403.01002 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Attribute Structuring Improves LLM-Based Evaluation of Clinical Text SummariesZelalem Gero , Chandan Singh , Yiqing Xie , Sheng Zhang , Tristan Naumann , Jianfeng Gao , Hoifung PoonComments: 4 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Summarizing clinical text is crucial in health decision-support and clinical research. Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation, especially in safety-critical domains such as health. Holistically evaluating text summaries is challenging because they may contain unsubstantiated information. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. It decomposes the evaluation process into a grounded procedure that uses an LLM for relatively simple structuring and scoring tasks, rather than the full task of holistic summary evaluation. Experiments show that AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization. Additionally, AS yields interpretations in the form of a short text span corresponding to each output, which enables efficient human auditing, paving the way towards trustworthy evaluation of clinical information in resource-constrained scenarios. We release our code, prompts, and an open-source benchmark at this https URL .
- [509] arXiv:2403.01003 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: FlaKat: A Machine Learning-Based Categorization Framework for Flaky TestsSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Flaky tests can pass or fail non-deterministically, without alterations to a software system. Such tests are frequently encountered by developers and hinder the credibility of test suites. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy. Moreover, the majority of automated flaky test repair solutions are designed for specific types of flaky tests. This research work proposes a novel categorization framework, called FlaKat, which uses machine-learning classifiers for fast and accurate prediction of the category of a given flaky test that reflects its root cause. Sampling techniques are applied to address the imbalance between flaky test categories in the International Dataset of Flaky Test (IDoFT). A new evaluation metric, called Flakiness Detection Capacity (FDC), is proposed for measuring the accuracy of classifiers from the perspective of information theory and provides proof for its effectiveness. The final FDC results are also in agreement with F1 score regarding which classifier yields the best flakiness classification.
- [510] arXiv:2403.01005 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Policy Optimization for PDE Control with a Warm StartSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: Dimensionality reduction is crucial for controlling nonlinear partial differential equations (PDE) through a "reduce-then-design" strategy, which identifies a reduced-order model and then implements model-based control solutions. However, inaccuracies in the reduced-order modeling can substantially degrade controller performance, especially in PDEs with chaotic behavior. To address this issue, we augment the reduce-then-design procedure with a policy optimization (PO) step. The PO step fine-tunes the model-based controller to compensate for the modeling error from dimensionality reduction. This augmentation shifts the overall strategy into reduce-then-design-then-adapt, where the model-based controller serves as a warm start for PO. Specifically, we study the state-feedback tracking control of PDEs that aims to align the PDE state with a specific constant target subject to a linear-quadratic cost. Through extensive experiments, we show that a few iterations of PO can significantly improve the model-based controller performance. Our approach offers a cost-effective alternative to PDE control using end-to-end reinforcement learning.
- [511] arXiv:2403.01024 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Reservoir Computing Using Measurement-Controlled Quantum DynamicsSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
Abstract: Physical reservoir computing (RC) is a machine learning algorithm that employs the dynamics of a physical system to forecast highly nonlinear and chaotic phenomena. In this paper, we introduce a quantum RC system that employs the dynamics of a probed atom in a cavity. The atom experiences coherent driving at a particular rate, leading to a measurement-controlled quantum evolution. The proposed quantum reservoir can make fast and reliable forecasts using a small number of artificial neurons compared with the traditional RC algorithm. We theoretically validate the operation of the reservoir, demonstrating its potential to be used in error-tolerant applications, where approximate computing approaches may be used to make feasible forecasts in conditions of limited computational and energy resources.
- [512] arXiv:2403.01031 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Peacock: A Family of Arabic Multimodal Large Language Models and BenchmarksFakhraddin Alwajih , El Moatez Billah Nagoudi , Gagan Bhatia , Abdelrahman Mohamed , Muhammad Abdul-MageedComments: Under ReviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, including even those with large speaker populations such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed \textit{Peacock}, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce ~\textit{Henna}, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs.The GitHub repository for the \textit{Peacock} project is available at \url{ this https URL }.
- [513] arXiv:2403.01038 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacksJiacen Xu , Jack W. Stokes , Geoff McDonald , Xuesong Bai , David Marshall , Siyue Wang , Adith Swaminathan , Zhou LiSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have demonstrated impressive results on natural language tasks, and security researchers are beginning to employ them in both offensive and defensive systems. In cyber-security, there have been multiple research efforts that utilize LLMs focusing on the pre-breach stage of attacks like phishing and malware generation. However, so far there lacks a comprehensive study regarding whether LLM-based systems can be leveraged to simulate the post-breach stage of attacks that are typically human-operated, or "hands-on-keyboard" attacks, under various attack techniques and environments.
As LLMs inevitably advance, they may be able to automate both the pre- and post-breach attack stages. This shift may transform organizational attacks from rare, expert-led events to frequent, automated operations requiring no expertise and executed at automation speed and scale. This risks fundamentally changing global computer security and correspondingly causing substantial economic impacts, and a goal of this work is to better understand these risks now so we can better prepare for these inevitable ever-more-capable LLMs on the horizon. On the immediate impact side, this research serves three purposes. First, an automated LLM-based, post-breach exploitation framework can help analysts quickly test and continually improve their organization's network security posture against previously unseen attacks. Second, an LLM-based penetration test system can extend the effectiveness of red teams with a limited number of human analysts. Finally, this research can help defensive systems and teams learn to detect novel attack behaviors preemptively before their use in the wild.... - [514] arXiv:2403.01046 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection FeaturesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract: We prove that training neural networks on 1-D data is equivalent to solving a convex Lasso problem with a fixed, explicitly defined dictionary matrix of features. The specific dictionary depends on the activation and depth. We consider 2-layer networks with piecewise linear activations, deep narrow ReLU networks with up to 4 layers, and rectangular and tree networks with sign activation and arbitrary depth. Interestingly in ReLU networks, a fourth layer creates features that represent reflections of training data about themselves. The Lasso representation sheds insight to globally optimal networks and the solution landscape.
- [515] arXiv:2403.01053 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic ModelingComments: CVPR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven nature. With the ever-increasing stream of research data collection, it would be appealing to autonomously explore patterns and insights from observational data for discovering novel classes of phenotypes and concepts. However, in the biomedical domain, there are several challenges inherently presented in the cumulated data which hamper the progress of novel class discovery. The non-i.i.d. data distribution accompanied by the severe imbalance among different groups of classes essentially leads to ambiguous and biased semantic representations. In this work, we present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. First, we propose to parameterize the approximated posterior of instance embedding as a marginal von MisesFisher distribution to account for the interference of distributional latent bias. Then, we incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space, which in turn minimizes the uncontrollable risk for unknown class learning and structuring. Furthermore, a spectral graph-theoretic method is devised to estimate the number of potential novel classes. It inherits two intriguing merits compared to existent approaches, namely high computational efficiency and flexibility for taxonomy-adaptive estimation. Extensive experiments across various biomedical scenarios substantiate the effectiveness and general applicability of our method.
- [516] arXiv:2403.01055 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Towards Full Authorship with AI: Supporting Revision with AI-Generated ViewsJiho Kim , Ray C. Flanagan , Noelle E. Haviland , ZeAi Sun , Souad N. Yakubu , Edom A. Maru , Kenneth C. ArnoldComments: 15 pages, 2 figures; Accepted to 5th Workshop on Human-AI Co-Creation with Generative Models (HAI-GEN) at ACM IUI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Large language models (LLMs) are shaping a new user interface (UI) paradigm in writing tools by enabling users to generate text through prompts. This paradigm shifts some creative control from the user to the system, thereby diminishing the user's authorship and autonomy in the writing process. To restore autonomy, we introduce Textfocals, a UI prototype designed to investigate a human-centered approach that emphasizes the user's role in writing. Textfocals supports the writing process by providing LLM-generated summaries, questions, and advice (i.e., LLM views) in a sidebar of a text editor, encouraging reflection and self-driven revision in writing without direct text generation. Textfocals' UI affordances, including contextually adaptive views and scaffolding for prompt selection and customization, offer a novel way to interact with LLMs where users maintain full authorship of their writing. A formative user study with Textfocals showed promising evidence that this approach might help users develop underdeveloped ideas, cater to the rhetorical audience, and clarify their writing. However, the study also showed interaction design challenges related to document navigation and scoping, prompt engineering, and context management. Our work highlights the breadth of the design space of writing support interfaces powered by generative AI that maintain authorship integrity.
- [517] arXiv:2403.01071 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: GraphRCG: Self-conditioned Graph Generation via Bootstrapped RepresentationsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph generation generally aims to create new graphs that closely align with a specific graph distribution. Existing works often implicitly capture this distribution through the optimization of generators, potentially overlooking the intricacies of the distribution itself. Furthermore, these approaches generally neglect the insights offered by the learned distribution for graph generation. In contrast, in this work, we propose a novel self-conditioned graph generation framework designed to explicitly model graph distributions and employ these distributions to guide the generation process. We first perform self-conditioned modeling to capture the graph distributions by transforming each graph sample into a low-dimensional representation and optimizing a representation generator to create new representations reflective of the learned distribution. Subsequently, we leverage these bootstrapped representations as self-conditioned guidance for the generation process, thereby facilitating the generation of graphs that more accurately reflect the learned distributions. We conduct extensive experiments on generic and molecular graph datasets across various fields. Our framework demonstrates superior performance over existing state-of-the-art graph generation methods in terms of graph quality and fidelity to training data.
- [518] arXiv:2403.01078 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: $\Gamma$-VAE: Curvature regularized variational autoencoders for uncovering emergent low dimensional geometric structure in high dimensional dataJason Z. Kim , Nicolas Perrin-Gilbert , Erkan Narmanli , Paul Klein , Christopher R. Myers , Itai Cohen , Joshua J. Waterfall , James P. SethnaComments: 8 pages, 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Genomics (q-bio.GN)
Abstract: Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. For example, despite the tens of thousands of genes in the human genome, the principled study of genomics is fruitful because biological processes rely on coordinated organization that results in lower dimensional phenotypes. To uncover this organization, many nonlinear dimensionality reduction techniques have successfully embedded high-dimensional data into low-dimensional spaces by preserving local similarities between data points. However, the nonlinearities in these methods allow for too much curvature to preserve general trends across multiple non-neighboring data clusters, thereby limiting their interpretability and generalizability to out-of-distribution data. Here, we address both of these limitations by regularizing the curvature of manifolds generated by variational autoencoders, a process we coin ``$\Gamma$-VAE''. We demonstrate its utility using two example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage tracing experiment in hematopoietic stem cell differentiation. We find that the resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them. Finally, we show that preserving long-range relationships to differentiated cells separates undifferentiated cells -- which have not yet specialized -- according to their eventual fate. Broadly, we anticipate that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models in any high-dimensional system with emergent low-dimensional behavior.
- [519] arXiv:2403.01079 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation FrameworkComments: 20 pages, with AppendixSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We study the challenging problem for inference tasks on large-scale graph datasets of Graph Neural Networks: huge time and memory consumption, and try to overcome it by reducing reliance on graph structure. Even though distilling graph knowledge to student MLP is an excellent idea, it faces two major problems of positional information loss and low generalization. To solve the problems, we propose a new three-stage multitask distillation framework. In detail, we use Positional Encoding to capture positional information. Also, we introduce Neural Heat Kernels responsible for graph data processing in GNN and utilize hidden layer outputs matching for better performance of student MLP's hidden layers. To the best of our knowledge, it is the first work to include hidden layer distillation for student MLP on graphs and to combine graph Positional Encoding with MLP. We test its performance and robustness with several settings and draw the conclusion that our work can outperform well with good stability.
- [520] arXiv:2403.01091 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: COOL: A Conjoint Perspective on Spatio-Temporal Graph Neural Network for Traffic ForecastingWei Ju , Yusheng Zhao , Yifang Qin , Siyu Yi , Jingyang Yuan , Zhiping Xiao , Xiao Luo , Xiting Yan , Ming ZhangComments: Accepted by Information Fusion 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Abstract: This paper investigates traffic forecasting, which attempts to forecast the future state of traffic based on historical situations. This problem has received ever-increasing attention in various scenarios and facilitated the development of numerous downstream applications such as urban planning and transportation management. However, the efficacy of existing methods remains sub-optimal due to their tendency to model temporal and spatial relationships independently, thereby inadequately accounting for complex high-order interactions of both worlds. Moreover, the diversity of transitional patterns in traffic forecasting makes them challenging to capture for existing approaches, warranting a deeper exploration of their diversity. Toward this end, this paper proposes Conjoint Spatio-Temporal graph neural network (abbreviated as COOL), which models heterogeneous graphs from prior and posterior information to conjointly capture high-order spatio-temporal relationships. On the one hand, heterogeneous graphs connecting sequential observation are constructed to extract composite spatio-temporal relationships via prior message passing. On the other hand, we model dynamic relationships using constructed affinity and penalty graphs, which guide posterior message passing to incorporate complementary semantic information into node representations. Moreover, to capture diverse transitional properties to enhance traffic forecasting, we propose a conjoint self-attention decoder that models diverse temporal patterns from both multi-rank and multi-scale views. Experimental results on four popular benchmark datasets demonstrate that our proposed COOL provides state-of-the-art performance compared with the competitive baselines.
- [521] arXiv:2403.01101 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Feature Alignment: Rethinking Efficient Active Learning via Proxy in the Context of Pre-trained ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Fine-tuning the pre-trained model with active learning holds promise for reducing annotation costs. However, this combination introduces significant computational costs, particularly with the growing scale of pre-trained models. Recent research has proposed proxy-based active learning, which pre-computes features to reduce computational costs. Yet, this approach often incurs a significant loss in active learning performance, which may even outweigh the computational cost savings. In this paper, we argue the performance drop stems not only from pre-computed features' inability to distinguish between categories of labeled samples, resulting in the selection of redundant samples but also from the tendency to compromise valuable pre-trained information when fine-tuning with samples selected through the proxy model. To address this issue, we propose a novel method called aligned selection via proxy to update pre-computed features while selecting a proper training method to inherit valuable pre-training information. Extensive experiments validate that our method significantly improves the total cost of efficient active learning while maintaining computational efficiency.
- [522] arXiv:2403.01106 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Distilling Text Style Transfer With Self-Explanation From LLMsComments: Accepted by NAACL Student Research Workshop 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Text Style Transfer (TST) seeks to alter the style of text while retaining its core content. Given the constraints of limited parallel datasets for TST, we propose CoTeX, a framework that leverages large language models (LLMs) alongside chain-of-thought (CoT) prompting to facilitate TST. CoTeX distills the complex rewriting and reasoning capabilities of LLMs into more streamlined models capable of working with both non-parallel and parallel data. Through experimentation across four TST datasets, CoTeX is shown to surpass traditional supervised fine-tuning and knowledge distillation methods, particularly in low-resource settings. We conduct a comprehensive evaluation, comparing CoTeX against current unsupervised, supervised, in-context learning (ICL) techniques, and instruction-tuned LLMs. Furthermore, CoTeX distinguishes itself by offering transparent explanations for its style transfer process.
- [523] arXiv:2403.01118 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Adversarial Testing for Visual Grounding via Image-Aware Property ReductionComments: 14pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Due to the advantages of fusing information from various modalities, multimodal learning is gaining increasing attention. Being a fundamental task of multimodal learning, Visual Grounding (VG), aims to locate objects in images through natural language expressions. Ensuring the quality of VG models presents significant challenges due to the complex nature of the task. In the black box scenario, existing adversarial testing techniques often fail to fully exploit the potential of both modalities of information. They typically apply perturbations based solely on either the image or text information, disregarding the crucial correlation between the two modalities, which would lead to failures in test oracles or an inability to effectively challenge VG models. To this end, we propose PEELING, a text perturbation approach via image-aware property reduction for adversarial testing of the VG model. The core idea is to reduce the property-related information in the original expression meanwhile ensuring the reduced expression can still uniquely describe the original object in the image. To achieve this, PEELING first conducts the object and properties extraction and recombination to generate candidate property reduction expressions. It then selects the satisfied expressions that accurately describe the original object while ensuring no other objects in the image fulfill the expression, through querying the image with a visual understanding technique. We evaluate PEELING on the state-of-the-art VG model, i.e. OFA-VG, involving three commonly used datasets. Results show that the adversarial tests generated by PEELING achieves 21.4% in MultiModal Impact score (MMI), and outperforms state-of-the-art baselines for images and texts by 8.2%--15.1%.
- [524] arXiv:2403.01121 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: OpenGraph: Towards Open Graph Foundation ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: Graph learning has become indispensable for interpreting and harnessing relational data in diverse fields, ranging from recommendation systems to social network analysis. In this context, a variety of GNNs have emerged as promising methodologies for encoding the structural information of graphs. By effectively capturing the graph's underlying structure, these GNNs have shown great potential in enhancing performance in graph learning tasks, such as link prediction and node classification. However, despite their successes, a significant challenge persists: these advanced methods often face difficulties in generalizing to unseen graph data that significantly differs from the training instances. In this work, our aim is to advance the graph learning paradigm by developing a general graph foundation model. This model is designed to understand the complex topological patterns present in diverse graph data, enabling it to excel in zero-shot graph learning tasks across different downstream datasets. To achieve this goal, we address several key technical challenges in our OpenGraph model. Firstly, we propose a unified graph tokenizer to adapt our graph model to generalize well on unseen graph data, even when the underlying graph properties differ significantly from those encountered during training. Secondly, we develop a scalable graph transformer as the foundational encoder, which effectively captures node-wise dependencies within the global topological context. Thirdly, we introduce a data augmentation mechanism enhanced by a LLM to alleviate the limitations of data scarcity in real-world scenarios. Extensive experiments validate the effectiveness of our framework. By adapting our OpenGraph to new graph characteristics and comprehending the nuances of diverse graphs, our approach achieves remarkable zero-shot graph learning performance across various settings and domains.
- [525] arXiv:2403.01131 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: LLaMoCo: Instruction Tuning of Large Language Models for Optimization Code GenerationSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE)
Abstract: Recent research explores optimization using large language models (LLMs) by either iteratively seeking next-step solutions from LLMs or directly prompting LLMs for an optimizer. However, these approaches exhibit inherent limitations, including low operational efficiency, high sensitivity to prompt design, and a lack of domain-specific knowledge. We introduce LLaMoCo, the first instruction-tuning framework designed to adapt LLMs for solving optimization problems in a code-to-code manner. Specifically, we establish a comprehensive instruction set containing well-described problem prompts and effective optimization codes. We then develop a novel two-phase learning strategy that incorporates a contrastive learning-based warm-up procedure before the instruction-tuning phase to enhance the convergence behavior during model fine-tuning. The experiment results demonstrate that a CodeGen (350M) model fine-tuned by our LLaMoCo achieves superior optimization performance compared to GPT-4 Turbo and the other competitors across both synthetic and realistic problem sets. The fine-tuned model and the usage instructions are available at https://anonymous.4open.science/r/LLaMoCo-722A.
- [526] arXiv:2403.01136 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive QuantizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. There is a lack of designs to support efficient LLM serving using a heterogeneous cluster, while the current solutions focus on model partition and uniform compression among homogeneous devices. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. We carefully decide on mixed-precision model quantization together with phase-aware model partition and micro-batch sizing in distributed LLM serving with an efficient algorithm, to greatly enhance inference throughput while fulfilling user-specified model quality targets. Extensive experiments on production inference workloads in 11 different clusters demonstrate that LLM-PQ achieves up to 2.88x (2.26x on average) throughput improvement in inference, showing great advantages over state-of-the-art works.
- [527] arXiv:2403.01139 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ParallelPARC: A Scalable Pipeline for Generating Natural-Language AnalogiesComments: NAACL 2024 mainSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Analogy-making is central to human cognition, allowing us to adapt to novel situations -- an ability that current AI systems still lack. Most analogy datasets today focus on simple analogies (e.g., word analogies); datasets including complex types of analogies are typically manually curated and very small. We believe that this holds back progress in computational analogy. In this work, we design a data generation pipeline, ParallelPARC (Parallel Paragraph Creator) leveraging state-of-the-art Large Language Models (LLMs) to create complex, paragraph-based analogies, as well as distractors, both simple and challenging. We demonstrate our pipeline and create ProPara-Logy, a dataset of analogies between scientific processes. We publish a gold-set, validated by humans, and a silver-set, generated automatically. We test LLMs' and humans' analogy recognition in binary and multiple-choice settings, and found that humans outperform the best models (~13% gap) after a light supervision. We demonstrate that our silver-set is useful for training models. Lastly, we show challenging distractors confuse LLMs, but not humans. We hope our pipeline will encourage research in this emerging field.
- [528] arXiv:2403.01147 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Hybrid Model for Traffic Incident Detection based on Generative Adversarial Networks and Transformer ModelComments: 19 pages, 8 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In addition to enhancing traffic safety and facilitating prompt emergency response, traffic incident detection plays an indispensable role in intelligent transportation systems by providing real-time traffic status information. This enables the realization of intelligent traffic control and management. Previous research has identified that apart from employing advanced algorithmic models, the effectiveness of detection is also significantly influenced by challenges related to acquiring large datasets and addressing dataset imbalances. A hybrid model combining transformer and generative adversarial networks (GANs) is proposed to address these challenges. Experiments are conducted on four real datasets to validate the superiority of the transformer in traffic incident detection. Additionally, GANs are utilized to expand the dataset and achieve a balanced ratio of 1:4, 2:3, and 1:1. The proposed model is evaluated against the baseline model. The results demonstrate that the proposed model enhances the dataset size, balances the dataset, and improves the performance of traffic incident detection in various aspects.
- [529] arXiv:2403.01152 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Survey of AI-generated Text Forensic Systems: Detection, Attribution, and CharacterizationTharindu Kumarage , Garima Agrawal , Paras Sheth , Raha Moraffah , Aman Chadha , Joshua Garland , Huan LiuSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We have witnessed lately a rapid proliferation of advanced Large Language Models (LLMs) capable of generating high-quality text. While these LLMs have revolutionized text generation across various domains, they also pose significant risks to the information ecosystem, such as the potential for generating convincing propaganda, misinformation, and disinformation at scale. This paper offers a review of AI-generated text forensic systems, an emerging field addressing the challenges of LLM misuses. We present an overview of the existing efforts in AI-generated text forensics by introducing a detailed taxonomy, focusing on three primary pillars: detection, attribution, and characterization. These pillars enable a practical understanding of AI-generated text, from identifying AI-generated content (detection), determining the specific AI model involved (attribution), and grouping the underlying intents of the text (characterization). Furthermore, we explore available resources for AI-generated text forensics research and discuss the evolving challenges and future directions of forensic systems in an AI era.
- [530] arXiv:2403.01165 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: STAR: Constraint LoRA with Dynamic Active Learning for Data-Efficient Fine-Tuning of Large Language ModelsComments: Our code and results will be available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Though Large Language Models (LLMs) have demonstrated the powerful capabilities of few-shot learning through prompting methods, supervised training is still necessary for complex reasoning tasks. Because of their extensive parameters and memory consumption, both Parameter-Efficient Fine-Tuning (PEFT) methods and Memory-Efficient Fine-Tuning methods have been proposed for LLMs. Nevertheless, the issue of large annotated data consumption, the aim of Data-Efficient Fine-Tuning, remains unexplored. One obvious way is to combine the PEFT method with active learning. However, the experimental results show that such a combination is not trivial and yields inferior results. Through probe experiments, such observation might be explained by two main reasons: uncertainty gap and poor model calibration. Therefore, in this paper, we propose a novel approach to effectively integrate uncertainty-based active learning and LoRA. Specifically, for the uncertainty gap, we introduce a dynamic uncertainty measurement that combines the uncertainty of the base model and the uncertainty of the full model during the iteration of active learning. For poor model calibration, we incorporate the regularization method during LoRA training to keep the model from being over-confident, and the Monte-Carlo dropout mechanism is employed to enhance the uncertainty estimation. Experimental results show that the proposed approach outperforms existing baseline models on three complex reasoning tasks.
- [531] arXiv:2403.01166 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DINER: Debiasing Aspect-based Sentiment Analysis with Multi-variable Causal InferenceComments: Our code and results will be available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Though notable progress has been made, neural-based aspect-based sentiment analysis (ABSA) models are prone to learn spurious correlations from annotation biases, resulting in poor robustness on adversarial data transformations. Among the debiasing solutions, causal inference-based methods have attracted much research attention, which can be mainly categorized into causal intervention methods and counterfactual reasoning methods. However, most of the present debiasing methods focus on single-variable causal inference, which is not suitable for ABSA with two input variables (the target aspect and the review). In this paper, we propose a novel framework based on multi-variable causal inference for debiasing ABSA. In this framework, different types of biases are tackled based on different causal intervention methods. For the review branch, the bias is modeled as indirect confounding from context, where backdoor adjustment intervention is employed for debiasing. For the aspect branch, the bias is described as a direct correlation with labels, where counterfactual reasoning is adopted for debiasing. Extensive experiments demonstrate the effectiveness of the proposed method compared to various baselines on the two widely used real-world aspect robustness test set datasets.
- [532] arXiv:2403.01183 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Leveraging Self-Supervised Learning for Scene Recognition in Child Sexual Abuse ImageryComments: 13 pages, 5 figures, 4 tables. Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing & Exploited Children every year, and over 80% originated from online sources. Therefore, investigation centers and clearinghouses cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene recognition task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to target tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.
- [533] arXiv:2403.01185 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Balancing Exploration and Exploitation in LLM using Soft RLLF for Enhanced Negation UnderstandingComments: JURISIN 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Finetuning approaches in NLP often focus on exploitation rather than exploration, which may lead to suboptimal models. Given the vast search space of natural language, this limited exploration can restrict their performance in complex, high-stakes domains, where accurate negation understanding and logical reasoning abilities are crucial. To address this issue, we leverage Reinforcement Learning from Logical Feedback (RLLF) to create an effective balance between exploration and exploitation in LLMs. Our approach employs an appropriate benchmark dataset for training and evaluation, highlighting the importance of exploration in enhancing negation understanding capabilities. We compare the performance of our RLLF-enhanced LLMs with baseline models trained without RLLF, demonstrating the value of this balanced approach. Furthermore, we showcase the potential of our method in legal AI applications by employing transfer learning and evaluating its impact on negation understanding. Our experimental results exhibit the effectiveness of balancing exploration and exploitation with RLLF in improving LLMs' negation capabilities. This has implications for the development of more accurate, reliable, and logically consistent language models in high-stakes domains.
- [534] arXiv:2403.01193 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: RAGged Edges: The Double-Edged Sword of Retrieval-Augmented ChatbotsComments: 7 Pages, 1 Figure, 1 TableSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) like ChatGPT demonstrate the remarkable progress of artificial intelligence. However, their tendency to hallucinate -- generate plausible but false information -- poses a significant challenge. This issue is critical, as seen in recent court cases where ChatGPT's use led to citations of non-existent legal rulings. This paper explores how Retrieval-Augmented Generation (RAG) can counter hallucinations by integrating external knowledge with prompts. We empirically evaluate RAG against standard LLMs using prompts designed to induce hallucinations. Our results show that RAG increases accuracy in some cases, but can still be misled when prompts directly contradict the model's pre-trained understanding. These findings highlight the complex nature of hallucinations and the need for more robust solutions to ensure LLM reliability in real-world applications. We offer practical recommendations for RAG deployment and discuss implications for the development of more trustworthy LLMs.
- [535] arXiv:2403.01196 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021Journal-ref: Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Translation models for the specific domain of translating Covid data from English to Irish were developed for the LoResMT 2021 shared task. Domain adaptation techniques, using a Covid-adapted generic 55k corpus from the Directorate General of Translation, were applied. Fine-tuning, mixed fine-tuning and combined dataset approaches were compared with models trained on an extended in-domain dataset. As part of this study, an English-Irish dataset of Covid related data, from the Health and Education domains, was developed. The highest-performing model used a Transformer architecture trained with an extended in-domain Covid dataset. In the context of this study, we have demonstrated that extending an 8k in-domain baseline dataset by just 5k lines improved the BLEU score by 27 points.
- [536] arXiv:2403.01210 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SAR-AE-SFP: SAR Imagery Adversarial Example in Real Physics domain with Target Scattering Feature ParametersComments: 10 pages, 9 figures, 2 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Deep neural network-based Synthetic Aperture Radar (SAR) target recognition models are susceptible to adversarial examples. Current adversarial example generation methods for SAR imagery primarily operate in the 2D digital domain, known as image adversarial examples. Recent work, while considering SAR imaging scatter mechanisms, fails to account for the actual imaging process, rendering attacks in the three-dimensional physical domain infeasible, termed pseudo physics adversarial examples. To address these challenges, this paper proposes SAR-AE-SFP-Attack, a method to generate real physics adversarial examples by altering the scattering feature parameters of target objects. Specifically, we iteratively optimize the coherent energy accumulation of the target echo by perturbing the reflection coefficient and scattering coefficient in the scattering feature parameters of the three-dimensional target object, and obtain the adversarial example after echo signal processing and imaging processing in the RaySAR simulator. Experimental results show that compared to digital adversarial attack methods, SAR-AE-SFP Attack significantly improves attack efficiency on CNN-based models (over 30\%) and Transformer-based models (over 13\%), demonstrating significant transferability of attack effects across different models and perspectives.
- [537] arXiv:2403.01216 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: API Is Enough: Conformal Prediction for Large Language Models Without Logit-AccessSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study aims to address the pervasive challenge of quantifying uncertainty in large language models (LLMs) without logit-access. Conformal Prediction (CP), known for its model-agnostic and distribution-free features, is a desired approach for various LLMs and data distributions. However, existing CP methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. In addition, logits are known to be miscalibrated, potentially leading to degraded CP performance. To tackle these challenges, we introduce a novel CP method that (1) is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage. The core idea of this approach is to formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Experimental results on both close-ended and open-ended Question Answering tasks show our approach can mostly outperform the logit-based CP baselines.
- [538] arXiv:2403.01221 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Two-Stage Algorithm for Cost-Efficient Multi-instance Counterfactual ExplanationsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Counterfactual explanations constitute among the most popular methods for analyzing the predictions of black-box systems since they can recommend cost-efficient and actionable changes to the input to turn an undesired system's output into a desired output. While most of the existing counterfactual methods explain a single instance, several real-world use cases, such as customer satisfaction, require the identification of a single counterfactual that can satisfy multiple instances (e.g. customers) simultaneously. In this work, we propose a flexible two-stage algorithm for finding groups of instances along with cost-efficient multi-instance counterfactual explanations. This is motivated by the fact that in most previous works the aspect of finding such groups is not addressed.
- [539] arXiv:2403.01229 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: REWIND Dataset: Privacy-preserving Speaking Status Segmentation from Multimodal Body Movement Signals in the WildSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract: Recognizing speaking in humans is a central task towards understanding social interactions. Ideally, speaking would be detected from individual voice recordings, as done previously for meeting scenarios. However, individual voice recordings are hard to obtain in the wild, especially in crowded mingling scenarios due to cost, logistics, and privacy concerns. As an alternative, machine learning models trained on video and wearable sensor data make it possible to recognize speech by detecting its related gestures in an unobtrusive, privacy-preserving way. These models themselves should ideally be trained using labels obtained from the speech signal. However, existing mingling datasets do not contain high quality audio recordings. Instead, speaking status annotations have often been inferred by human annotators from video, without validation of this approach against audio-based ground truth. In this paper we revisit no-audio speaking status estimation by presenting the first publicly available multimodal dataset with high-quality individual speech recordings of 33 subjects in a professional networking event. We present three baselines for no-audio speaking status segmentation: a) from video, b) from body acceleration (chest-worn accelerometer), c) from body pose tracks. In all cases we predict a 20Hz binary speaking status signal extracted from the audio, a time resolution not available in previous datasets. In addition to providing the signals and ground truth necessary to evaluate a wide range of speaking status detection methods, the availability of audio in REWIND makes it suitable for cross-modality studies not feasible with previous mingling datasets. Finally, our flexible data consent setup creates new challenges for multimodal systems under missing modalities.
- [540] arXiv:2403.01232 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Polynormer: Polynomial-Expressive Graph Transformer in Linear TimeComments: Published as a conference paper at International Conference on Learning Representations (ICLR) 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph transformers (GTs) have emerged as a promising architecture that is theoretically more expressive than message-passing graph neural networks (GNNs). However, typical GT models have at least quadratic complexity and thus cannot scale to large graphs. While there are several linear GTs recently proposed, they still lag behind GNN counterparts on several popular graph datasets, which poses a critical concern on their practical expressivity. To balance the trade-off between expressivity and scalability of GTs, we propose Polynormer, a polynomial-expressive GT model with linear complexity. Polynormer is built upon a novel base model that learns a high-degree polynomial on input features. To enable the base model permutation equivariant, we integrate it with graph topology and node features separately, resulting in local and global equivariant attention models. Consequently, Polynormer adopts a linear local-to-global attention scheme to learn high-degree equivariant polynomials whose coefficients are controlled by attention scores. Polynormer has been evaluated on $13$ homophilic and heterophilic datasets, including large graphs with millions of nodes. Our extensive experiment results show that Polynormer outperforms state-of-the-art GNN and GT baselines on most datasets, even without the use of nonlinear activation functions.
- [541] arXiv:2403.01241 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens IntactRuikang Liu , Haoli Bai , Haokun Lin , Yuening Li , Han Gao , Zhengzhuo Xu , Lu Hou , Jun Yao , Chun YuanSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outlier in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which is crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement and achieves lossless weight-only INT4 quantization on various downstream tasks, leading to the new state-of-the-art for LLM quantization.
- [542] arXiv:2403.01242 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Augmenting Automation: Intent-Based User Instruction Classification with Machine LearningComments: 7 pages, 14 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Abstract: Electric automation systems offer convenience and efficiency in controlling electrical circuits and devices. Traditionally, these systems rely on predefined commands for control, limiting flexibility and adaptability. In this paper, we propose a novel approach to augment automation by introducing intent-based user instruction classification using machine learning techniques. Our system represents user instructions as intents, allowing for dynamic control of electrical circuits without relying on predefined commands. Through a machine learning model trained on a labeled dataset of user instructions, our system classifies intents from user input, enabling a more intuitive and adaptable control scheme. We present the design and implementation of our intent-based electric automation system, detailing the development of the machine learning model for intent classification. Experimental results demonstrate the effectiveness of our approach in enhancing user experience and expanding the capabilities of electric automation systems. Our work contributes to the advancement of smart technologies by providing a more seamless interaction between users and their environments.
- [543] arXiv:2403.01244 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized RehearsalJianheng Huang , Leyang Cui , Ante Wang , Chengyi Yang , Xinting Liao , Linfeng Song , Junfeng Yao , Jinsong SuSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) suffer from catastrophic forgetting during continual learning. Conventional rehearsal-based methods rely on previous training data to retain the model's ability, which may not be feasible in real-world applications. When conducting continual learning based on a publicly-released LLM checkpoint, the availability of the original training data may be non-existent. To address this challenge, we propose a framework called Self-Synthesized Rehearsal (SSR) that uses the LLM to generate synthetic instances for rehearsal. Concretely, we first employ the base LLM for in-context learning to generate synthetic instances. Subsequently, we utilize the latest LLM to refine the instance outputs based on the synthetic inputs, preserving its acquired ability. Finally, we select diverse high-quality synthetic instances for rehearsal in future stages. Experimental results demonstrate that SSR achieves superior or comparable performance compared to conventional rehearsal-based approaches while being more data-efficient. Besides, SSR effectively preserves the generalization capabilities of LLMs in general domains.
- [544] arXiv:2403.01248 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender CodeZiniu Hu , Ahmet Iscen , Aashi Jain , Thomas Kipf , Yisong Yue , David A. Ross , Cordelia Schmid , Alireza FathiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.
- [545] arXiv:2403.01255 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Automatic Speech Recognition using Advanced Deep Learning Approaches: A surveyJournal-ref: Information Fusion, Elsevier, 2024Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Abstract: Recent advancements in deep learning (DL) have posed a significant challenge for automatic speech recognition (ASR). ASR relies on extensive training datasets, including confidential ones, and demands substantial computational and storage resources. Enabling adaptive systems improves ASR performance in dynamic environments. DL techniques assume training and testing data originate from the same domain, which is not always true. Advanced DL techniques like deep transfer learning (DTL), federated learning (FL), and reinforcement learning (RL) address these issues. DTL allows high-performance models using small yet related datasets, FL enables training on confidential data without dataset possession, and RL optimizes decision-making in dynamic environments, reducing computation costs. This survey offers a comprehensive review of DTL, FL, and RL-based ASR frameworks, aiming to provide insights into the latest developments and aid researchers and professionals in understanding the current challenges. Additionally, transformers, which are advanced DL techniques heavily used in proposed ASR frameworks, are considered in this survey for their ability to capture extensive dependencies in the input ASR sequence. The paper starts by presenting the background of DTL, FL, RL, and Transformers and then adopts a well-designed taxonomy to outline the state-of-the-art approaches. Subsequently, a critical analysis is conducted to identify the strengths and weaknesses of each framework. Additionally, a comparative study is presented to highlight the existing challenges, paving the way for future research opportunities.
- [546] arXiv:2403.01273 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free AttentionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2$\times$ at 16k context length. Our results are reproducible at this https URL .
- [547] arXiv:2403.01277 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Optimal Integrated Task and Path Planning and Its Application to Multi-Robot Pickup and DeliverySubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: We propose a generic multi-robot planning mechanism that combines an optimal task planner and an optimal path planner to provide a scalable solution for complex multi-robot planning problems. The Integrated planner, through the interaction of the task planner and the path planner, produces optimal collision-free trajectories for the robots. We illustrate our general algorithm on an object pick-and-drop planning problem in a warehouse scenario where a group of robots is entrusted with moving objects from one location to another in the workspace. We solve the task planning problem by reducing it into an SMT-solving problem and employing the highly advanced SMT solver Z3 to solve it. To generate collision-free movement of the robots, we extend the state-of-the-art algorithm Conflict Based Search with Precedence Constraints with several domain-specific constraints. We evaluate our integrated task and path planner extensively on various instances of the object pick-and-drop planning problem and compare its performance with a state-of-the-art multi-robot classical planner. Experimental results demonstrate that our planning mechanism can deal with complex planning problems and outperforms a state-of-the-art classical planner both in terms of computation time and the quality of the generated plan.
- [548] arXiv:2403.01281 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Fast Low-parameter Video Activity Localization in Collaborative Learning EnvironmentsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Research on video activity detection has primarily focused on identifying well-defined human activities in short video segments. The majority of the research on video activity recognition is focused on the development of large parameter systems that require training on large video datasets. This paper develops a low-parameter, modular system with rapid inferencing capabilities that can be trained entirely on limited datasets without requiring transfer learning from large-parameter systems. The system can accurately detect and associate specific activities with the students who perform the activities in real-life classroom videos. Additionally, the paper develops an interactive web-based application to visualize human activity maps over long real-life classroom videos.
- [549] arXiv:2403.01286 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Summary Paper: Use Case on Building Collaborative Safe Autonomous Systems-A Robotdog for Guiding Visually Impaired PeopleSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Abstract: This is a summary paper of a use case of a Robotdog dedicated to guide visually impaired people in complex environment like a smart intersection. In such scenarios, the Robotdog has to autonomously decide whether it is safe to cross the intersection or not in order to further guide the human. We leverage data sharing and collaboration between the Robotdog and other autonomous systems operating in the same environment. We propose a system architecture for autonomous systems through a separation of a collaborative decision layer, to enable collective decision making processes, where data about the environment, relevant to the Robotdog decision, together with evidences for trustworthiness about other systems and the environment are shared.
- [550] arXiv:2403.01308 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: VBART: The Turkish LLMSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is up to 11x more efficient than multilingual tokenizers. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned vngrs-web-corpus of 135 GB are publicly available at this http URL .
- [551] arXiv:2403.01309 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: VNLP: Turkish NLP PackageSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this work, we present VNLP: the first dedicated, complete, open-source, well-documented, lightweight, production-ready, state-of-the-art Natural Language Processing (NLP) package for the Turkish language. It contains a wide variety of tools, ranging from the simplest tasks, such as sentence splitting and text normalization, to the more advanced ones, such as text and token classification models. Its token classification models are based on "Context Model", a novel architecture that is both an encoder and an auto-regressive model. NLP tasks solved by VNLP models include but are not limited to Sentiment Analysis, Named Entity Recognition, Morphological Analysis \& Disambiguation and Part-of-Speech Tagging. Moreover, it comes with pre-trained word embeddings and corresponding SentencePiece Unigram tokenizers. VNLP has an open-source GitHub repository, ReadtheDocs documentation, PyPi package for convenient installation, Python and command-line API and a demo page to test all the functionality. Consequently, our main contribution is a complete, compact, easy-to-install and easy-to-use NLP package for Turkish.
- [552] arXiv:2403.01329 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Bespoke Non-Stationary Solvers for Fast Sampling of Diffusion and Flow ModelsNeta Shaul , Uriel Singer , Ricky T. Q. Chen , Matthew Le , Ali Thabet , Albert Pumarola , Yaron LipmanSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This paper introduces Bespoke Non-Stationary (BNS) Solvers, a solver distillation approach to improve sample efficiency of Diffusion and Flow models. BNS solvers are based on a family of non-stationary solvers that provably subsumes existing numerical ODE solvers and consequently demonstrate considerable improvement in sample approximation (PSNR) over these baselines. Compared to model distillation, BNS solvers benefit from a tiny parameter space ($<$200 parameters), fast optimization (two orders of magnitude faster), maintain diversity of samples, and in contrast to previous solver distillation approaches nearly close the gap from standard distillation methods such as Progressive Distillation in the low-medium NFE regime. For example, BNS solver achieves 45 PSNR / 1.76 FID using 16 NFE in class-conditional ImageNet-64. We experimented with BNS solvers for conditional image generation, text-to-image generation, and text-2-audio generation showing significant improvement in sample approximation (PSNR) in all.
- [553] arXiv:2403.01332 (cross-list from q-bio.QM) [ pdf , ps , html , other ]
-
Title: Chaining thoughts and LLMs to learn DNA structural biophysicsSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The future development of an AI scientist, a tool that is capable of integrating a variety of experimental data and generating testable hypotheses, holds immense potential. So far, bespoke machine learning models have been created to specialize in singular scientific tasks, but otherwise lack the flexibility of a general purpose model. Here, we show that a general purpose large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the structural biophysics of DNA. We find that both fine-tuning models to return chain-of-thought responses and chaining together models fine-tuned for subtasks have an enhanced ability to analyze and design DNA sequences and their structures.
- [554] arXiv:2403.01348 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: SANGRIA: Stacked Autoencoder Neural Networks with Gradient Boosting for Indoor LocalizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Indoor localization is a critical task in many embedded applications, such as asset tracking, emergency response, and realtime navigation. In this article, we propose a novel fingerprintingbased framework for indoor localization called SANGRIA that uses stacked autoencoder neural networks with gradient boosted trees. Our approach is designed to overcome the device heterogeneity challenge that can create uncertainty in wireless signal measurements across embedded devices used for localization. We compare SANGRIA to several state-of-the-art frameworks and demonstrate 42.96% lower average localization error across diverse indoor locales and heterogeneous devices.
- [555] arXiv:2403.01369 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech EnhancementComments: 8 pages; Shorter form accepted in ICASSP 2024Subjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training.
- [556] arXiv:2403.01384 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: On the Compressibility of Quantized Large Language ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of quantized LLMs, the trade-off between the compressibility and performance of quantized LLMs, and opportunities to optimize both of them jointly.
- [557] arXiv:2403.01400 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Decoupling Weighing and Selecting for Integrating Multiple Graph Pre-training TasksComments: Published as a conference paper at ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Recent years have witnessed the great success of graph pre-training for graph representation learning. With hundreds of graph pre-training tasks proposed, integrating knowledge acquired from multiple pre-training tasks has become a popular research topic. In this paper, we identify two important collaborative processes for this topic: (1) select: how to select an optimal task combination from a given task pool based on their compatibility, and (2) weigh: how to weigh the selected tasks based on their importance. While there currently has been a lot of work focused on weighing, comparatively little effort has been devoted to selecting. This paper proposes a novel instance-level framework for integrating multiple graph pre-training tasks, Weigh And Select (WAS), where the two collaborative processes, weighing and selecting, are combined by decoupled siamese networks. Specifically, it first adaptively learns an optimal combination of tasks for each instance from a given task pool, based on which a customized instance-level task weighing strategy is learned. Extensive experiments on 16 graph datasets across node-level and graph-level downstream tasks have demonstrated that by combining a few simple but classical tasks, WAS can achieve comparable performance to other leading counterparts. The code is available at this https URL .
- [558] arXiv:2403.01407 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Region-Transformer: Self-Attention Region Based Class-Agnostic Point Cloud SegmentationComments: 8 pages, 5 figures, 3 tablesJournal-ref: 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4 VISAPP: VISAPP, 341-348, 2024 , Rome, ItalySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Point cloud segmentation, which helps us understand the environment of specific structures and objects, can be performed in class-specific and class-agnostic ways. We propose a novel region-based transformer model called Region-Transformer for performing class-agnostic point cloud segmentation. The model utilizes a region-growth approach and self-attention mechanism to iteratively expand or contract a region by adding or removing points. It is trained on simulated point clouds with instance labels only, avoiding semantic labels. Attention-based networks have succeeded in many previous methods of performing point cloud segmentation. However, a region-growth approach with attention-based networks has yet to be used to explore its performance gain. To our knowledge, we are the first to use a self-attention mechanism in a region-growth approach. With the introduction of self-attention to region-growth that can utilize local contextual information of neighborhood points, our experiments demonstrate that the Region-Transformer model outperforms previous class-agnostic and class-specific methods on indoor datasets regarding clustering metrics. The model generalizes well to large-scale scenes. Key advantages include capturing long-range dependencies through self-attention, avoiding the need for semantic labels during training, and applicability to a variable number of objects. The Region-Transformer model represents a promising approach for flexible point cloud segmentation with applications in robotics, digital twinning, and autonomous vehicles.
- [559] arXiv:2403.01413 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Exploring the Design of Generative AI in Supporting Music-based Reminiscence for Older AdultsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Music-based reminiscence has the potential to positively impact the psychological well-being of older adults. However, the aging process and physiological changes, such as memory decline and limited verbal communication, may impede the ability of older adults to recall their memories and life experiences. Given the advanced capabilities of generative artificial intelligence (AI) systems, such as generated conversations and images, and their potential to facilitate the reminiscing process, this study aims to explore the design of generative AI to support music-based reminiscence in older adults. This study follows a user-centered design approach incorporating various stages, including detailed interviews with two social workers and two design workshops (involving ten older adults). Our work contributes to an in-depth understanding of older adults' attitudes toward utilizing generative AI for supporting music-based reminiscence and identifies concrete design considerations for the future design of generative AI to enhance the reminiscence experience of older adults.
- [560] arXiv:2403.01437 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity FeaturesComments: 5 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Moment retrieval (MR) and highlight detection (HD) aim to identify relevant moments and highlights in video from corresponding natural language query. Large language models (LLMs) have demonstrated proficiency in various computer vision tasks. However, existing methods for MR\&HD have not yet been integrated with LLMs. In this letter, we propose a novel two-stage model that takes the output of LLMs as the input to the second-stage transformer encoder-decoder. First, MiniGPT-4 is employed to generate the detailed description of the video frame and rewrite the query statement, fed into the encoder as new features. Then, semantic similarity is computed between the generated description and the rewritten queries. Finally, continuous high-similarity video frames are converted into span anchors, serving as prior position information for the decoder. Experiments demonstrate that our approach achieves a state-of-the-art result, and by using only span anchors and similarity scores as outputs, positioning accuracy outperforms traditional methods, like Moment-DETR.
- [561] arXiv:2403.01456 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT AssessmentSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Item difficulty plays a crucial role in adaptive testing. However, few works have focused on generating questions of varying difficulty levels, especially for multiple-choice (MC) cloze tests. We propose training pre-trained language models (PLMs) as surrogate models to enable item response theory (IRT) assessment, avoiding the need for human test subjects. We also propose two strategies to control the difficulty levels of both the gaps and the distractors using ranking rules to reduce invalid distractors. Experimentation on a benchmark dataset demonstrates that our proposed framework and methods can effectively control and evaluate the difficulty levels of MC cloze tests.
- [562] arXiv:2403.01467 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Collaborate to Adapt: Source-Free Graph Domain Adaptation via Bi-directional AdaptationComments: Accepted by WWW-2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Unsupervised Graph Domain Adaptation (UGDA) has emerged as a practical solution to transfer knowledge from a label-rich source graph to a completely unlabelled target graph. However, most methods require a labelled source graph to provide supervision signals, which might not be accessible in the real-world settings due to regulations and privacy concerns. In this paper, we explore the scenario of source-free unsupervised graph domain adaptation, which tries to address the domain adaptation problem without accessing the labelled source graph. Specifically, we present a novel paradigm called GraphCTA, which performs model adaptation and graph adaptation collaboratively through a series of procedures: (1) conduct model adaptation based on node's neighborhood predictions in target graph considering both local and global information; (2) perform graph adaptation by updating graph structure and node attributes via neighborhood contrastive learning; and (3) the updated graph serves as an input to facilitate the subsequent iteration of model adaptation, thereby establishing a collaborative loop between model adaptation and graph adaptation. Comprehensive experiments are conducted on various public datasets. The experimental results demonstrate that our proposed model outperforms recent source-free baselines by large margins.
- [563] arXiv:2403.01475 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Representation Learning on Heterophilic Graph with Directional Neighborhood AttentionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: Graph Attention Network (GAT) is one of the most popular Graph Neural Network (GNN) architecture, which employs the attention mechanism to learn edge weights and has demonstrated promising performance in various applications. However, since it only incorporates information from immediate neighborhood, it lacks the ability to capture long-range and global graph information, leading to unsatisfactory performance on some datasets, particularly on heterophilic graphs. To address this limitation, we propose the Directional Graph Attention Network (DGAT) in this paper. DGAT is able to combine the feature-based attention with the global directional information extracted from the graph topology. To this end, a new class of Laplacian matrices is proposed which can provably reduce the diffusion distance between nodes. Based on the new Laplacian, topology-guided neighbour pruning and edge adding mechanisms are proposed to remove the noisy and capture the helpful long-range neighborhood information. Besides, a global directional attention is designed to enable a topological-aware information propagation. The superiority of the proposed DGAT over the baseline GAT has also been verified through experiments on real-world benchmarks and synthetic data sets. It also outperforms the state-of-the-art (SOTA) models on 6 out of 7 real-world benchmark datasets.
- [564] arXiv:2403.01479 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine TranslationComments: Accepted to LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation. Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the 'Align-to-Distill' (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention heads with their teacher counterparts during training. The Attention Alignment Module in A2D performs a dense head-by-head comparison between student and teacher attention heads across layers, turning the combinatorial mapping heuristics into a learning problem. Our experiments show the efficacy of A2D, demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De->Dsb and WMT-2014 En->De, respectively, compared to Transformer baselines.
- [565] arXiv:2403.01489 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image generative models have recently garnered significant attention due to their ability to generate images based on prompt descriptions. While these models have shown promising performance, concerns have been raised regarding the potential misuse of the generated fake images. In response to this, we have presented a simple yet effective training-free method to attribute fake images generated by text-to-image models to their source models. Given a test image to be attributed, we first inverse the textual prompt of the image, and then put the reconstructed prompt into different candidate models to regenerate candidate fake images. By calculating and ranking the similarity of the test image and the candidate images, we can determine the source of the image. This attribution allows model owners to be held accountable for any misuse of their models. Note that our approach does not limit the number of candidate text-to-image generative models. Comprehensive experiments reveal that (1) Our method can effectively attribute fake images to their source models, achieving comparable attribution performance with the state-of-the-art method; (2) Our method has high scalability ability, which is well adapted to real-world attribution scenarios. (3) The proposed method yields satisfactory robustness to common attacks, such as Gaussian blurring, JPEG compression, and Resizing. We also analyze the factors that influence the attribution performance, and explore the boost brought by the proposed method as a plug-in to improve the performance of existing SOTA. We hope our work can shed some light on the solutions to addressing the source of AI-generated images, as well as to prevent the misuse of text-to-image generative models.
- [566] arXiv:2403.01510 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: End-to-End Human Instance MattingJournal-ref: IEEE T-CSVT 2023Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Human instance matting aims to estimate an alpha matte for each human instance in an image, which is extremely challenging and has rarely been studied so far. Despite some efforts to use instance segmentation to generate a trimap for each instance and apply trimap-based matting methods, the resulting alpha mattes are often inaccurate due to inaccurate segmentation. In addition, this approach is computationally inefficient due to multiple executions of the matting method. To address these problems, this paper proposes a novel End-to-End Human Instance Matting (E2E-HIM) framework for simultaneous multiple instance matting in a more efficient manner. Specifically, a general perception network first extracts image features and decodes instance contexts into latent codes. Then, a united guidance network exploits spatial attention and semantics embedding to generate united semantics guidance, which encodes the locations and semantic correspondences of all instances. Finally, an instance matting network decodes the image features and united semantics guidance to predict all instance-level alpha mattes. In addition, we construct a large-scale human instance matting dataset (HIM-100K) comprising over 100,000 human images with instance alpha matte labels. Experiments on HIM-100K demonstrate the proposed E2E-HIM outperforms the existing methods on human instance matting with 50% lower errors and 5X faster speed (6 instances in a 640X640 image). Experiments on the PPM-100, RWP-636, and P3M datasets demonstrate that E2E-HIM also achieves competitive performance on traditional human matting.
- [567] arXiv:2403.01528 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A SurveyComments: Survey Paper. 25 pages, 9 figures, and 3 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Abstract: The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{ this https URL }.
- [568] arXiv:2403.01533 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Machine learning predicts long-term mortality after acute myocardial infarction using systolic time intervals and routinely collected clinical dataComments: Accepted for publication in "Intelligent Medicine"Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Precise estimation of cardiac patients' current and future comorbidities is an important factor in prioritizing continuous physiological monitoring and new therapies. ML models have shown satisfactory performance in short-term mortality prediction of patients with heart disease, while their utility in long-term predictions is limited. This study aims to investigate the performance of tree-based ML models on long-term mortality prediction and the effect of two recently introduced biomarkers on long-term mortality. This study utilized publicly available data from CCHIA at the Ministry of Health and Welfare, Taiwan, China. Medical records were used to gather demographic and clinical data, including age, gender, BMI, percutaneous coronary intervention (PCI) status, and comorbidities such as hypertension, dyslipidemia, ST-segment elevation myocardial infarction (STEMI), and non-STEMI. Using medical and demographic records as well as two recently introduced biomarkers, brachial pre-ejection period (bPEP) and brachial ejection time (bET), collected from 139 patients with acute myocardial infarction, we investigated the performance of advanced ensemble tree-based ML algorithms (random forest, AdaBoost, and XGBoost) to predict all-cause mortality within 14 years. The developed ML models achieved significantly better performance compared to the baseline LR (C-Statistic, 0.80 for random forest, 0.79 for AdaBoost, and 0.78 for XGBoost, vs 0.77 for LR) (P-RF<0.001, PAdaBoost<0.001, PXGBoost<0.05). Adding bPEP and bET to our feature set significantly improved the algorithms' performance, leading to an absolute increase in C-Statistic of up to 0.03 (C-Statistic, 0.83 for random forest, 0.82 for AdaBoost, and 0.80 for XGBoost, vs 0.74 for LR) (P-RF<0.001, PAdaBoost<0.001, PXGBoost<0.05). This advancement may enable better treatment prioritization for high-risk individuals.
- [569] arXiv:2403.01548 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination MitigationComments: code repo is available at: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) frequently hallucinate and produce factual errors, yet our understanding of why they make these errors remains limited. In this study, we delve into the underlying mechanisms of LLM hallucinations from the perspective of inner representations, and discover a salient pattern associated with hallucinations: correct generations tend to have sharper context activations in the hidden states of the in-context tokens, compared to the incorrect ones. Leveraging this insight, we propose an entropy-based metric to quantify the ``sharpness'' among the in-context hidden states and incorporate it into the decoding process to formulate a constrained decoding approach. Experiments on various knowledge-seeking and hallucination benchmarks demonstrate our approach's consistent effectiveness, for example, achieving up to an 8.6 point improvement on TruthfulQA. We believe this study can improve our understanding of hallucinations and serve as a practical solution for hallucination mitigation.
- [570] arXiv:2403.01564 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: ComTraQ-MPC: Meta-Trained DQN-MPC Integration for Trajectory Tracking with Limited Active Localization UpdatesComments: * Equal contributionSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Optimal decision-making for trajectory tracking in partially observable, stochastic environments where the number of active localization updates -- the process by which the agent obtains its true state information from the sensors -- are limited, presents a significant challenge. Traditional methods often struggle to balance resource conservation, accurate state estimation and precise tracking, resulting in suboptimal performance. This problem is particularly pronounced in environments with large action spaces, where the need for frequent, accurate state data is paramount, yet the capacity for active localization updates is restricted by external limitations. This paper introduces ComTraQ-MPC, a novel framework that combines Deep Q-Networks (DQN) and Model Predictive Control (MPC) to optimize trajectory tracking with constrained active localization updates. The meta-trained DQN ensures adaptive active localization scheduling, while the MPC leverages available state information to improve tracking. The central contribution of this work is their reciprocal interaction: DQN's update decisions inform MPC's control strategy, and MPC's outcomes refine DQN's learning, creating a cohesive, adaptive system. Empirical evaluations in simulated and real-world settings demonstrate that ComTraQ-MPC significantly enhances operational efficiency and accuracy, providing a generalizable and approximately optimal solution for trajectory tracking in complex partially observable environments.
- [571] arXiv:2403.01567 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: ReMatch: Retrieval Enhanced Schema Matching with LLMsSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI)
Abstract: Schema matching is a crucial task in data integration, involving the alignment of a source database schema with a target schema to establish correspondence between their elements. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. Although machine-learning-based solutions have been explored in numerous studies, they often suffer from low accuracy, require manual mapping of the schemas for model training, or need access to source schema data which might be unavailable due to privacy concerns. In this paper we present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs). Our method avoids the need for predefined mapping, any model training, or access to data in the source database. In the ReMatch method the tables of the target schema and the attributes of the source schema are first represented as structured passage-based documents. For each source attribute document, we retrieve $J$ documents, representing target schema tables, according to their semantic relevance. Subsequently, we create a prompt for every source table, comprising all its attributes and their descriptions, alongside all attributes from the set of top $J$ target tables retrieved previously. We employ LLMs using this prompt for the matching task, yielding a ranked list of $K$ potential matches for each source attribute. Our experimental results on large real-world schemas demonstrate that ReMatch significantly improves matching capabilities and outperforms other machine learning approaches. By eliminating the requirement for training data, ReMatch becomes a viable solution for real-world scenarios.
- [572] arXiv:2403.01569 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Kick Back & Relax++: Scaling Beyond Ground-Truth Depth with SlowTV & CribsTVSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Self-supervised learning is the key to unlocking generic computer vision systems. By eliminating the reliance on ground-truth annotations, it allows scaling to much larger data quantities. Unfortunately, self-supervised monocular depth estimation (SS-MDE) has been limited by the absence of diverse training data. Existing datasets have focused exclusively on urban driving in densely populated cities, resulting in models that fail to generalize beyond this domain.
To address these limitations, this paper proposes two novel datasets: SlowTV and CribsTV. These are large-scale datasets curated from publicly available YouTube videos, containing a total of 2M training frames. They offer an incredibly diverse set of environments, ranging from snowy forests to coastal roads, luxury mansions and even underwater coral reefs. We leverage these datasets to tackle the challenging task of zero-shot generalization, outperforming every existing SS-MDE approach and even some state-of-the-art supervised methods.
The generalization capabilities of our models are further enhanced by a range of components and contributions: 1) learning the camera intrinsics, 2) a stronger augmentation regime targeting aspect ratio changes, 3) support frame randomization, 4) flexible motion estimation, 5) a modern transformer-based architecture. We demonstrate the effectiveness of each component in extensive ablation experiments. To facilitate the development of future research, we make the datasets, code and pretrained models available to the public at this https URL . - [573] arXiv:2403.01575 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: SARD: A Human-AI Collaborative Story GenerationSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Generative artificial intelligence (GenAI) has ushered in a new era for storytellers, providing a powerful tool to ignite creativity and explore uncharted narrative territories. As technology continues to advance, the synergy between human creativity and AI-generated content holds the potential to redefine the landscape of storytelling. In this work, we propose SARD, a drag-and-drop visual interface for generating a multi-chapter story using large language models. Our evaluation of the usability of SARD and its creativity support shows that while node-based visualization of the narrative may help writers build a mental model, it exerts unnecessary mental overhead to the writer and becomes a source of distraction as the story becomes more elaborated. We also found that AI generates stories that are less lexically diverse, irrespective of the complexity of the story. We identified some patterns and limitations of our tool that can guide the development of future human-AI co-writing tools.
- [574] arXiv:2403.01580 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI ArchitecturesComments: PhD thesisSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In the current machine translation (MT) landscape, the Transformer architecture stands out as the gold standard, especially for high-resource language pairs. This research delves into its efficacy for low-resource language pairs including both the English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi language pairs. Notably, the study identifies the optimal hyperparameters and subword model type to significantly improve the translation quality of Transformer models for low-resource language pairs.
The scarcity of parallel datasets for low-resource languages can hinder MT development. To address this, gaHealth was developed, the first bilingual corpus of health data for the Irish language. Focusing on the health domain, models developed using this in-domain dataset exhibited very significant improvements in BLEU score when compared with models from the LoResMT2021 Shared Task. A subsequent human evaluation using the multidimensional quality metrics error taxonomy showcased the superior performance of the Transformer system in reducing both accuracy and fluency errors compared to an RNN-based counterpart.
Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source applications streamlined for the development, fine-tuning, and deployment of neural machine translation models. These tools considerably simplify the setup and evaluation process, making MT more accessible to both developers and translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes eco-friendly natural language processing research by highlighting the environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM demonstrated advancements in translation performance for two low-resource language pairs: English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 Shared Task. - [575] arXiv:2403.01598 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: APISR: Anime Production Inspired Real-World Anime Super-ResolutionSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: While real-world anime super-resolution (SR) has gained increasing attention in the SR community, existing methods still adopt techniques from the photorealistic domain. In this paper, we analyze the anime production workflow and rethink how to use characteristics of it for the sake of the real-world anime SR. First, we argue that video networks and datasets are not necessary for anime SR due to the repetition use of hand-drawing frames. Instead, we propose an anime image collection pipeline by choosing the least compressed and the most informative frames from the video sources. Based on this pipeline, we introduce the Anime Production-oriented Image (API) dataset. In addition, we identify two anime-specific challenges of distorted and faint hand-drawn lines and unwanted color artifacts. We address the first issue by introducing a prediction-oriented compression module in the image degradation model and a pseudo-ground truth preparation with enhanced hand-drawn lines. In addition, we introduce the balanced twin perceptual loss combining both anime and photorealistic high-level features to mitigate unwanted color artifacts and increase visual clarity. We evaluate our method through extensive experiments on the public benchmark, showing our method outperforms state-of-the-art anime dataset-trained approaches.
- [576] arXiv:2403.01599 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional VideosComments: Accepted by ICLR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in large language models (LLMs) to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.
- [577] arXiv:2403.01600 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Can Poverty Be Reduced by Acting on Discrimination? An Agent-based Model for Policy MakingSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI)
Abstract: In the last decades, there has been a deceleration in the rates of poverty reduction, suggesting that traditional redistributive approaches to poverty mitigation could be losing effectiveness, and alternative insights to advance the number one UN Sustainable Development Goal are required. The criminalization of poor people has been denounced by several NGOs, and an increasing number of voices suggest that discrimination against the poor (a phenomenon known as \emph{aporophobia}) could be an impediment to mitigating poverty. In this paper, we present the novel Aporophobia Agent-Based Model (AABM) to provide evidence of the correlation between aporophobia and poverty computationally. We present our use case built with real-world demographic data and poverty-mitigation public policies (either enforced or under parliamentary discussion) for the city of Barcelona. We classify policies as discriminatory or non-discriminatory against the poor, with the support of specialized NGOs, and we observe the results in the AABM in terms of the impact on wealth inequality. The simulation provides evidence of the relationship between aporophobia and the increase of wealth inequality levels, paving the way for a new generation of poverty reduction policies that act on discrimination and tackle poverty as a societal problem (not only a problem of the poor).
- [578] arXiv:2403.01605 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Provable Log Density Policy GradientSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning. Modern policy gradient methods, although successful, introduce a residual error in gradient estimation. In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods. To that end, we propose log density gradient to estimate the policy gradient, which corrects for this residual error term. Log density gradient method computes policy gradient by utilising the state-action discounted distributional formulation. We first present the equations needed to exactly find the log density gradient for a tabular Markov Decision Processes (MDPs). For more complex environments, we propose a temporal difference (TD) method that approximates log density gradient by utilizing backward on-policy samples. Since backward sampling from a Markov chain is highly restrictive we also propose a min-max optimization that can approximate log density gradient using just on-policy samples. We also prove uniqueness, and convergence under linear function approximation, for this min-max optimization. Finally, we show that the sample complexity of our min-max optimization to be of the order of $m^{-1/2}$, where $m$ is the number of on-policy samples. We also demonstrate a proof-of-concept for our log density gradient method on gridworld environment, and observe that our method is able to improve upon the classical policy gradient method by a clear margin, thus indicating a promising novel direction to develop reinforcement learning algorithms that require fewer samples.
- [579] arXiv:2403.01606 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Unified Model Selection Technique for Spectral Clustering Based Motion SegmentationComments: for the published version, see this https URLJournal-ref: Journal of Computational Vision and Imaging Systems 9 (2023) 68-71Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Motion segmentation is a fundamental problem in computer vision and is crucial in various applications such as robotics, autonomous driving and action recognition. Recently, spectral clustering based methods have shown impressive results on motion segmentation in dynamic environments. These methods perform spectral clustering on motion affinity matrices to cluster objects or point trajectories in the scene into different motion groups. However, existing methods often need the number of motions present in the scene to be known, which significantly reduces their practicality. In this paper, we propose a unified model selection technique to automatically infer the number of motion groups for spectral clustering based motion segmentation methods by combining different existing model selection techniques together. We evaluate our method on the KT3DMoSeg dataset and achieve competitve results comparing to the baseline where the number of clusters is given as ground truth information.
- [580] arXiv:2403.01621 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Machine Learning vs Deep Learning: The Generalization ProblemComments: 10 pages, 2 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The capacity to generalize beyond the range of training data is a pivotal challenge, often synonymous with a model's utility and robustness. This study investigates the comparative abilities of traditional machine learning (ML) models and deep learning (DL) algorithms in terms of extrapolation -- a more challenging aspect of generalization because it requires the model to make inferences about data points that lie outside the domain it has been trained on. We present an empirical analysis where both ML and DL models are trained on an exponentially growing function and then tested on values outside the training domain. The choice of this function allows us to distinctly showcase the divergence in performance when models are required to predict beyond the scope of their training data. Our findings suggest that deep learning models possess inherent capabilities to generalize beyond the training scope, an essential feature for real-world applications where data is often incomplete or extends beyond the observed range. This paper argues for a nuanced understanding of the structural differences between ML and DL models, with an emphasis on the implications for both theoretical research and practical deployment.
- [581] arXiv:2403.01643 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: You Need to Pay Better AttentionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: We introduce three new attention mechanisms that outperform standard multi-head attention in terms of efficiency and learning capabilities, thereby improving the performance and broader deployability of Transformer models. Our first contribution is Optimised Attention, which performs similarly to standard attention, but has 3/4 as many parameters and one matrix multiplication fewer per head. Next, we introduce Efficient Attention, which performs on par with standard attention with only 1/2 as many parameters as many parameters and two matrix multiplications fewer per head and is up to twice as fast as standard attention. Lastly, we introduce Super Attention, which surpasses standard attention by a significant margin in both vision and natural language processing tasks while having fewer parameters and matrix multiplications. In addition to providing rigorous mathematical comparisons, we evaluate the presented attention mechanisms on MNIST, CIFAR100, IMDB Movie Reviews, and Amazon Reviews datasets.
- [582] arXiv:2403.01649 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Recommendations for Government Development and Use of Advanced Automated Systems to Make Decisions about IndividualsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Contestability -- the ability to effectively challenge a decision -- is critical to the implementation of fairness. In the context of governmental decision making about individuals, contestability is often constitutionally required as an element of due process; specific procedures may be required by state or federal law relevant to a particular program. In addition, contestability can be a valuable way to discover systemic errors, contributing to ongoing assessments and system improvement.
On January 24-25, 2024, with support from the National Science Foundation and the William and Flora Hewlett Foundation, we convened a diverse group of government officials, representatives of leading technology companies, technology and policy experts from academia and the non-profit sector, advocates, and stakeholders for a workshop on advanced automated decision making, contestability, and the law. Informed by the workshop's rich and wide-ranging discussion, we offer these recommendations. A full report summarizing the discussion is in preparation. - [583] arXiv:2403.01673 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous VariablesSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: For Multivariate Time Series Forecasting (MTSF), recent deep learning applications show that univariate models frequently outperform multivariate ones. To address the difficiency in multivariate models, we introduce a method to Construct Auxiliary Time Series (CATS) that functions like a 2D temporal-contextual attention mechanism, which generates Auxiliary Time Series (ATS) from Original Time Series (OTS) to effectively represent and incorporate inter-series relationships for forecasting. Key principles of ATS - continuity, sparsity, and variability - are identified and implemented through different modules. Even with a basic 2-layer MLP as core predictor, CATS achieves state-of-the-art, significantly reducing complexity and parameters compared to previous multivariate models, marking it an efficient and transferable MTSF solution.
- [584] arXiv:2403.01693 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HanDiffuser: Text-to-Image Generation With Realistic Hand AppearancesSupreeth Narasimhaswamy , Uttaran Bhattacharya , Xiang Chen , Ishita Dasgupta , Saayan Mitra , Minh HoaiComments: Revisions: 1. Added a link to project page in the abstract, 2. Updated references and related work, 3. Fixed some grammatical errorsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.
- [585] arXiv:2403.01695 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DyCE: Dynamic Configurable Exiting for Deep Learning Compression and ScalingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Modern deep learning (DL) models necessitate the employment of scaling and compression techniques for effective deployment in resource-constrained environments. Most existing techniques, such as pruning and quantization are generally static. On the other hand, dynamic compression methods, such as early exits, reduce complexity by recognizing the difficulty of input samples and allocating computation as needed. Dynamic methods, despite their superior flexibility and potential for co-existing with static methods, pose significant challenges in terms of implementation due to any changes in dynamic parts will influence subsequent processes. Moreover, most current dynamic compression designs are monolithic and tightly integrated with base models, thereby complicating the adaptation to novel base models. This paper introduces DyCE, an dynamic configurable early-exit framework that decouples design considerations from each other and from the base model. Utilizing this framework, various types and positions of exits can be organized according to predefined configurations, which can be dynamically switched in real-time to accommodate evolving performance-complexity requirements. We also propose techniques for generating optimized configurations based on any desired trade-off between performance and computational complexity. This empowers future researchers to focus on the improvement of individual exits without latent compromise of overall system performance. The efficacy of this approach is demonstrated through image classification tasks with deep CNNs. DyCE significantly reduces the computational complexity by 23.5% of ResNet152 and 25.9% of ConvNextv2-tiny on ImageNet, with accuracy reductions of less than 0.5%. Furthermore, DyCE offers advantages over existing dynamic methods in terms of real-time configuration and fine-grained performance tuning.
- [586] arXiv:2403.01698 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Hypertext Entity Extraction in WebpageSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Webpage entity extraction is a fundamental natural language processing task in both research and applications. Nowadays, the majority of webpage entity extraction models are trained on structured datasets which strive to retain textual content and its structure information. However, existing datasets all overlook the rich hypertext features (e.g., font color, font size) which show their effectiveness in previous works. To this end, we first collect a \textbf{H}ypertext \textbf{E}ntity \textbf{E}xtraction \textbf{D}ataset (\textit{HEED}) from the e-commerce domains, scraping both the text and the corresponding explicit hypertext features with high-quality manual entity annotations. Furthermore, we present the \textbf{Mo}E-based \textbf{E}ntity \textbf{E}xtraction \textbf{F}ramework (\textit{MoEEF}), which efficiently integrates multiple features to enhance model performance by Mixture of Experts and outperforms strong baselines, including the state-of-the-art small-scale models and GPT-3.5-turbo. Moreover, the effectiveness of hypertext features in \textit{HEED} and several model components in \textit{MoEEF} are analyzed.
- [587] arXiv:2403.01699 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Brilla AI: AI Contestant for the National Science and Maths QuizGeorge Boateng , Jonathan Abrefah Mensah , Kevin Takyi Yeboah , William Edor , Andrew Kojo Mensah-Onumah , Naafi Dasana Ibrahim , Nana Sam YeboahComments: 14 pages. Accepted for the WideAIED track at the 25th International Conference on AI in Education (AIED 2024)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: The African continent lacks enough qualified teachers which hampers the provision of adequate learning support. An AI could potentially augment the efforts of the limited number of teachers, leading to better learning outcomes. Towards that end, this work describes and evaluates the first key output for the NSMQ AI Grand Challenge, which proposes a robust, real-world benchmark for such an AI: "Build an AI to compete live in Ghana's National Science and Maths Quiz (NSMQ) competition and win - performing better than the best contestants in all rounds and stages of the competition". The NSMQ is an annual live science and mathematics competition for senior secondary school students in Ghana in which 3 teams of 2 students compete by answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year. In this work, we built Brilla AI, an AI contestant that we deployed to unofficially compete remotely and live in the Riddles round of the 2023 NSMQ Grand Finale, the first of its kind in the 30-year history of the competition. Brilla AI is currently available as a web app that livestreams the Riddles round of the contest, and runs 4 machine learning systems: (1) speech to text (2) question extraction (3) question answering and (4) text to speech that work together in real-time to quickly and accurately provide an answer, and then say it with a Ghanaian accent. In its debut, our AI answered one of the 4 riddles ahead of the 3 human contesting teams, unofficially placing second (tied). Improvements and extensions of this AI could potentially be deployed to offer science tutoring to students and eventually enable millions across Africa to have one-on-one learning interactions, democratizing science education.
- [588] arXiv:2403.01709 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Can LLMs Generate Architectural Design Decisions? -An Exploratory Empirical studyComments: This paper has been accepted to IEEE ICSA 2024 (Main Track - Research Track)Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Architectural Knowledge Management (AKM) involves the organized handling of information related to architectural decisions and design within a project or organization. An essential artifact of AKM is the Architecture Decision Records (ADR), which documents key design decisions. ADRs are documents that capture decision context, decision made and various aspects related to a design decision, thereby promoting transparency, collaboration, and understanding. Despite their benefits, ADR adoption in software development has been slow due to challenges like time constraints and inconsistent uptake. Recent advancements in Large Language Models (LLMs) may help bridge this adoption gap by facilitating ADR generation. However, the effectiveness of LLM for ADR generation or understanding is something that has not been explored. To this end, in this work, we perform an exploratory study that aims to investigate the feasibility of using LLM for the generation of ADRs given the decision context. In our exploratory study, we utilize GPT and T5-based models with 0-shot, few-shot, and fine-tuning approaches to generate the Decision of an ADR given its Context. Our results indicate that in a 0-shot setting, state-of-the-art models such as GPT-4 generate relevant and accurate Design Decisions, although they fall short of human-level performance. Additionally, we observe that more cost-effective models like GPT-3.5 can achieve similar outcomes in a few-shot setting, and smaller models such as Flan-T5 can yield comparable results after fine-tuning. To conclude, this exploratory study suggests that LLM can generate Design Decisions, but further research is required to attain human-level generation and establish standardized widespread adoption.
- [589] arXiv:2403.01734 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery PolicyComments: Accepted by ICRA24Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Offline goal-conditioned reinforcement learning (GCRL) aims at solving goal-reaching tasks with sparse rewards from an offline dataset. While prior work has demonstrated various approaches for agents to learn near-optimal policies, these methods encounter limitations when dealing with diverse constraints in complex environments, such as safety constraints. Some of these approaches prioritize goal attainment without considering safety, while others excessively focus on safety at the expense of training efficiency. In this paper, we study the problem of constrained offline GCRL and propose a new method called Recovery-based Supervised Learning (RbSL) to accomplish safety-critical tasks with various goals. To evaluate the method performance, we build a benchmark based on the robot-fetching environment with a randomly positioned obstacle and use expert or random policies to generate an offline dataset. We compare RbSL with three offline GCRL algorithms and one offline safe RL algorithm. As a result, our method outperforms the existing state-of-the-art methods to a large extent. Furthermore, we validate the practicality and effectiveness of RbSL by deploying it on a real Panda manipulator. Code is available at this https URL .
- [590] arXiv:2403.01742 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Diffusion-TS: Interpretable Diffusion for General Time Series GenerationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Denoising diffusion probabilistic models (DDPMs) are becoming the leading paradigm for generative models. It has recently shown breakthroughs in audio synthesis, time series imputation and forecasting. In this paper, we propose Diffusion-TS, a novel diffusion-based framework that generates multivariate time series samples of high quality by using an encoder-decoder transformer with disentangled temporal representations, in which the decomposition technique guides Diffusion-TS to capture the semantic meaning of time series while transformers mine detailed sequential information from the noisy model input. Different from existing diffusion-based approaches, we train the model to directly reconstruct the sample instead of the noise in each diffusion step, combining a Fourier-based loss term. Diffusion-TS is expected to generate time series satisfying both interpretablity and realness. In addition, it is shown that the proposed Diffusion-TS can be easily extended to conditional generation tasks, such as forecasting and imputation, without any model changes. This also motivates us to further explore the performance of Diffusion-TS under irregular settings. Finally, through qualitative and quantitative experiments, results show that Diffusion-TS achieves the state-of-the-art results on various realistic analyses of time series.
- [591] arXiv:2403.01748 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Decode Neural signal as SpeechSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Decoding language from brain dynamics is an important open direction in the realm of brain-computer interface (BCI), especially considering the rapid growth of large language models. Compared to invasive-based signals which require electrode implantation surgery, non-invasive neural signals (e.g. EEG, MEG) have attracted increasing attention considering their safety and generality. However, the exploration is not adequate in three aspects: 1) previous methods mainly focus on EEG but none of the previous works address this problem on MEG with better signal quality; 2) prior works have predominantly used ``teacher-forcing" during generative decoding, which is impractical; 3) prior works are mostly ``BART-based" not fully auto-regressive, which performs better in other sequence tasks. In this paper, we explore the brain-to-text translation of MEG signals in a speech-decoding formation. Here we are the first to investigate a cross-attention-based ``whisper" model for generating text directly from MEG signals without teacher forcing. Our model achieves impressive BLEU-1 scores of 60.30 and 52.89 without pretraining \& teacher-forcing on two major datasets (\textit{GWilliams} and \textit{Schoffelen}). This paper conducts a comprehensive review to understand how speech decoding formation performs on the neural decoding tasks, including pretraining initialization, training \& evaluation set splitting, augmentation, and scaling law.
- [592] arXiv:2403.01768 (cross-list from cs.SY) [ pdf , ps , html , other ]
-
Title: Canonical Form of Datatic Description in Control SystemsSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI)
Abstract: The design of feedback controllers is undergoing a paradigm shift from modelic (i.e., model-driven) control to datatic (i.e., data-driven) control. Canonical form of state space model is an important concept in modelic control systems, exemplified by Jordan form, controllable form and observable form, whose purpose is to facilitate system analysis and controller synthesis. In the realm of datatic control, there is a notable absence in the standardization of data-based system representation. This paper for the first time introduces the concept of canonical data form for the purpose of achieving more effective design of datatic controllers. In a control system, the data sample in canonical form consists of a transition component and an attribute component. The former encapsulates the plant dynamics at the sampling time independently, which is a tuple containing three elements: a state, an action and their corresponding next state. The latter describes one or some artificial characteristics of the current sample, whose calculation must be performed in an online manner. The attribute of each sample must adhere to two requirements: (1) causality, ensuring independence from any future samples; and (2) locality, allowing dependence on historical samples but constrained to a finite neighboring set. The purpose of adding attribute is to offer some kinds of benefits for controller design in terms of effectiveness and efficiency. To provide a more close-up illustration, we present two canonical data forms: temporal form and spatial form, and demonstrate their advantages in reducing instability and enhancing training efficiency in two datatic control systems.
- [593] arXiv:2403.01769 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Safe Screening Rule with Bi-level Optimization of $\nu$ Support Vector MachineSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: Support vector machine (SVM) has achieved many successes in machine learning, especially for a small sample problem. As a famous extension of the traditional SVM, the $\nu$ support vector machine ($\nu$-SVM) has shown outstanding performance due to its great model interpretability. However, it still faces challenges in training overhead for large-scale problems. To address this issue, we propose a safe screening rule with bi-level optimization for $\nu$-SVM (SRBO-$\nu$-SVM) which can screen out inactive samples before training and reduce the computational cost without sacrificing the prediction accuracy. Our SRBO-$\nu$-SVM is strictly deduced by integrating the Karush-Kuhn-Tucker (KKT) conditions, the variational inequalities of convex problems and the $\nu$-property. Furthermore, we develop an efficient dual coordinate descent method (DCDM) to further improve computational speed. Finally, a unified framework for SRBO is proposed to accelerate many SVM-type models, and it is successfully applied to one-class SVM. Experimental results on 6 artificial data sets and 30 benchmark data sets have verified the effectiveness and safety of our proposed methods in supervised and unsupervised tasks.
- [594] arXiv:2403.01773 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Improving out-of-distribution generalization in graphs via hierarchical semantic environmentsComments: Accepted by CVPR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Out-of-distribution (OOD) generalization in the graph domain is challenging due to complex distribution shifts and a lack of environmental contexts. Recent methods attempt to enhance graph OOD generalization by generating flat environments. However, such flat environments come with inherent limitations to capture more complex data distributions. Considering the DrugOOD dataset, which contains diverse training environments (e.g., scaffold, size, etc.), flat contexts cannot sufficiently address its high heterogeneity. Thus, a new challenge is posed to generate more semantically enriched environments to enhance graph invariant learning for handling distribution shifts. In this paper, we propose a novel approach to generate hierarchical semantic environments for each graph. Firstly, given an input graph, we explicitly extract variant subgraphs from the input graph to generate proxy predictions on local environments. Then, stochastic attention mechanisms are employed to re-extract the subgraphs for regenerating global environments in a hierarchical manner. In addition, we introduce a new learning objective that guides our model to learn the diversity of environments within the same hierarchy while maintaining consistency across different hierarchies. This approach enables our model to consider the relationships between environments and facilitates robust graph invariant learning. Extensive experiments on real-world graph data have demonstrated the effectiveness of our framework. Particularly, in the challenging dataset DrugOOD, our method achieves up to 1.29\% and 2.83\% improvement over the best baselines on IC50 and EC50 prediction tasks, respectively.
- [595] arXiv:2403.01781 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence LearningComments: accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the realm of computer vision and graphics, accurately establishing correspondences between geometric 3D shapes is pivotal for applications like object tracking, registration, texture transfer, and statistical shape analysis. Moving beyond traditional hand-crafted and data-driven feature learning methods, we incorporate spectral methods with deep learning, focusing on functional maps (FMs) and optimal transport (OT). Traditional OT-based approaches, often reliant on entropy regularization OT in learning-based framework, face computational challenges due to their quadratic cost. Our key contribution is to employ the sliced Wasserstein distance (SWD) for OT, which is a valid fast optimal transport metric in an unsupervised shape matching framework. This unsupervised framework integrates functional map regularizers with a novel OT-based loss derived from SWD, enhancing feature alignment between shapes treated as discrete probability measures. We also introduce an adaptive refinement process utilizing entropy regularized OT, further refining feature alignments for accurate point-to-point correspondences. Our method demonstrates superior performance in non-rigid shape matching, including near-isometric and non-isometric scenarios, and excels in downstream tasks like segmentation transfer. The empirical results on diverse datasets highlight our framework's effectiveness and generalization capabilities, setting new standards in non-rigid shape matching with efficient OT metrics and an adaptive refinement module.
- [596] arXiv:2403.01791 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Beyond Recommender: An Exploratory Study of the Effects of Different AI Roles in AI-Assisted Decision MakingSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Artificial Intelligence (AI) is increasingly employed in various decision-making tasks, typically as a Recommender, providing recommendations that the AI deems correct. However, recent studies suggest this may diminish human analytical thinking and lead to humans' inappropriate reliance on AI, impairing the synergy in human-AI teams. In contrast, human advisors in group decision-making perform various roles, such as analyzing alternative options or criticizing decision-makers to encourage their critical thinking. This diversity of roles has not yet been empirically explored in AI assistance. In this paper, we examine three AI roles: Recommender, Analyzer, and Devil's Advocate, and evaluate their effects across two AI performance levels. Our results show each role's distinct strengths and limitations in task performance, reliance appropriateness, and user experience. Notably, the Recommender role is not always the most effective, especially if the AI performance level is low, the Analyzer role may be preferable. These insights offer valuable implications for designing AI assistants with adaptive functional roles according to different situations.
- [597] arXiv:2403.01801 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: COLA: Cross-city Mobility Transformer for Human Trajectory SimulationComments: Accepted by WWW 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Human trajectory data produced by daily mobile devices has proven its usefulness in various substantial fields such as urban planning and epidemic prevention. In terms of the individual privacy concern, human trajectory simulation has attracted increasing attention from researchers, targeting at offering numerous realistic mobility data for downstream tasks. Nevertheless, the prevalent issue of data scarcity undoubtedly degrades the reliability of existing deep learning models. In this paper, we are motivated to explore the intriguing problem of mobility transfer across cities, grasping the universal patterns of human trajectories to augment the powerful Transformer with external mobility data. There are two crucial challenges arising in the knowledge transfer across cities: 1) how to transfer the Transformer to adapt for domain heterogeneity; 2) how to calibrate the Transformer to adapt for subtly different long-tail frequency distributions of locations. To address these challenges, we have tailored a Cross-city mObiLity trAnsformer (COLA) with a dedicated model-agnostic transfer framework by effectively transferring cross-city knowledge for human trajectory simulation. Firstly, COLA divides the Transformer into the private modules for city-specific characteristics and the shared modules for city-universal mobility patterns. Secondly, COLA leverages a lightweight yet effective post-hoc adjustment strategy for trajectory simulation, without disturbing the complex bi-level optimization of model-agnostic knowledge transfer. Extensive experiments of COLA compared to state-of-the-art single-city baselines and our implemented cross-city baselines have demonstrated its superiority and effectiveness. The code is available at this https URL .
- [598] arXiv:2403.01818 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic SegmentationComments: Accepted by CVPR 2024; correct typos; this is not the camera-ready versionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Semi-supervised semantic segmentation (SSSS) has been proposed to alleviate the burden of time-consuming pixel-level manual labeling, which leverages limited labeled data along with larger amounts of unlabeled data. Current state-of-the-art methods train the labeled data with ground truths and unlabeled data with pseudo labels. However, the two training flows are separate, which allows labeled data to dominate the training process, resulting in low-quality pseudo labels and, consequently, sub-optimal results. To alleviate this issue, we present AllSpark, which reborns the labeled features from unlabeled ones with the channel-wise cross-attention mechanism. We further introduce a Semantic Memory along with a Channel Semantic Grouping strategy to ensure that unlabeled features adequately represent labeled features. The AllSpark shed new light on the architecture level designs of SSSS rather than framework level, which avoids increasingly complicated training pipeline designs. It can also be regarded as a flexible bottleneck module that can be seamlessly integrated into a general transformer-based segmentation model. The proposed AllSpark outperforms existing methods across all evaluation protocols on Pascal, Cityscapes and COCO benchmarks without bells-and-whistles. Code and model weights are available at: this https URL .
- [599] arXiv:2403.01823 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: RT-H: Action Hierarchies Using LanguageSuneel Belkhale , Tianli Ding , Ted Xiao , Pierre Sermanet , Quon Vuong , Jonathan Tompson , Yevgen Chebotar , Debidatta Dwibedi , Dorsa SadighSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at this https URL .
- [600] arXiv:2403.01827 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Analysis and Fully Memristor-based Reservoir Computing for Temporal Data ClassificationComments: 22 pages, 20 figures, Journal, Typo corrected and updated referenceSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Reservoir computing (RC) offers a neuromorphic framework that is particularly effective for processing spatiotemporal signals. Known for its temporal processing prowess, RC significantly lowers training costs compared to conventional recurrent neural networks. A key component in its hardware deployment is the ability to generate dynamic reservoir states. Our research introduces a novel dual-memory RC system, integrating a short-term memory via a WOx-based memristor, capable of achieving 16 distinct states encoded over 4 bits, and a long-term memory component using a TiOx-based memristor within the readout layer. We thoroughly examine both memristor types and leverage the RC system to process temporal data sets. The performance of the proposed RC system is validated through two benchmark tasks: isolated spoken digit recognition with incomplete inputs and Mackey-Glass time series prediction. The system delivered an impressive 98.84% accuracy in digit recognition and sustained a low normalized root mean square error (NRMSE) of 0.036 in the time series prediction task, underscoring its capability. This study illuminates the adeptness of memristor-based RC systems in managing intricate temporal challenges, laying the groundwork for further innovations in neuromorphic computing.
- [601] arXiv:2403.01840 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FreeA: Human-object Interaction Detection using Free Annotation LabelsComments: 11 pages, 7 figures, 6 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent human-object interaction (HOI) detection approaches rely on high cost of manpower and require comprehensive annotated image datasets. In this paper, we propose a novel self-adaption language-driven HOI detection method, termed as FreeA, without labeling by leveraging the adaptability of CLIP to generate latent HOI labels. To be specific, FreeA matches image features of human-object pairs with HOI text templates, and a priori knowledge-based mask method is developed to suppress improbable interactions. In addition, FreeA utilizes the proposed interaction correlation matching method to enhance the likelihood of actions related to a specified action, further refine the generated HOI labels. Experiments on two benchmark datasets show that FreeA achieves state-of-the-art performance among weakly supervised HOI models. Our approach is +8.58 mean Average Precision (mAP) on HICO-DET and +1.23 mAP on V-COCO more accurate in localizing and classifying the interactive actions than the newest weakly model, and +1.68 mAP and +7.28 mAP than the latest weakly+ model, respectively. Code will be available at this https URL .
- [602] arXiv:2403.01845 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: NASH: Neural Architecture Search for Hardware-Optimized Machine Learning ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: As machine learning (ML) algorithms get deployed in an ever-increasing number of applications, these algorithms need to achieve better trade-offs between high accuracy, high throughput and low latency. This paper introduces NASH, a novel approach that applies neural architecture search to machine learning hardware. Using NASH, hardware designs can achieve not only high throughput and low latency but also superior accuracy performance. We present four versions of the NASH strategy in this paper, all of which show higher accuracy than the original models. The strategy can be applied to various convolutional neural networks, selecting specific model operations among many to guide the training process toward higher accuracy. Experimental results show that applying NASH on ResNet18 or ResNet34 achieves a top 1 accuracy increase of up to 3.1% and a top 5 accuracy increase of up to 2.2% compared to the non-NASH version when tested on the ImageNet data set. We also integrated this approach into the FINN hardware model synthesis tool to automate the application of our approach and the generation of the hardware model. Results show that using FINN can achieve a maximum throughput of 324.5 fps. In addition, NASH models can also result in a better trade-off between accuracy and hardware resource utilization. The accuracy-hardware (HW) Pareto curve shows that the models with the four NASH versions represent the best trade-offs achieving the highest accuracy for a given HW utilization. The code for our implementation is open-source and publicly available on GitHub at this https URL .
- [603] arXiv:2403.01849 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language ModelsComments: CVPR2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large pre-trained Vision-Language Models (VLMs) like CLIP, despite having remarkable generalization ability, are highly vulnerable to adversarial examples. This work studies the adversarial robustness of VLMs from the novel perspective of the text prompt instead of the extensively studied model weights (frozen in this work). We first show that the effectiveness of both adversarial attack and defense are sensitive to the used text prompt. Inspired by this, we propose a method to improve resilience to adversarial attacks by learning a robust text prompt for VLMs. The proposed method, named Adversarial Prompt Tuning (APT), is effective while being both computationally and data efficient. Extensive experiments are conducted across 15 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show APT's superiority over hand-engineered prompts and other state-of-the-art adaption methods. APT demonstrated excellent abilities in terms of the in-distribution performance and the generalization under input distribution shift and across datasets. Surprisingly, by simply adding one learned word to the prompts, APT can significantly boost the accuracy and robustness (epsilon=4/255) over the hand-engineered prompts by +13% and +8.5% on average respectively. The improvement further increases, in our most effective setting, to +26.4% for accuracy and +16.7% for robustness. Code is available at this https URL .
- [604] arXiv:2403.01851 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Rethinking LLM Language Adaptation: A Case Study on Chinese MixtralComments: 13 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Mixtral, a representative sparse mixture of experts (SMoE) language model, has received significant attention due to its unique model design and superior performance. Based on Mixtral-8x7B-v0.1, in this paper, we propose Chinese-Mixtral and Chinese-Mixtral-Instruct with improved Chinese language abilities by adopting further pre-training and instruction fine-tuning. Experimental results show that our Chinese-Mixtral and Chinese-Mixtral-Instruct successfully improve Chinese understanding and generation performance while retaining the original English abilities. Then, we discuss several key questions when performing language adaptation on large language models, including the necessity of extending the language-specific vocabulary and the choice of the initialization model (foundation model v.s. instruction model), by providing empirical results and analysis. We also present the visualizations of each expert to examine their importance on downstream tasks. Our resources are publicly available through \url{ this https URL }.
- [605] arXiv:2403.01861 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: AiSDF: Structure-aware Neural Signed Distance Fields in Indoor ScenesComments: 8 pages, 6 figures, Accepted to IEEE RA-L (First two authors contributed equally)Journal-ref: IEEE Robotics and Automation Letters (RA-L), vol. 9, no. 5, pp. 4106-4113, 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Indoor scenes we are living in are visually homogenous or textureless, while they inherently have structural forms and provide enough structural priors for 3D scene reconstruction. Motivated by this fact, we propose a structure-aware online signed distance fields (SDF) reconstruction framework in indoor scenes, especially under the Atlanta world (AW) assumption. Thus, we dub this incremental SDF reconstruction for AW as AiSDF. Within the online framework, we infer the underlying Atlanta structure of a given scene and then estimate planar surfel regions supporting the Atlanta structure. This Atlanta-aware surfel representation provides an explicit planar map for a given scene. In addition, based on these Atlanta planar surfel regions, we adaptively sample and constrain the structural regularity in the SDF reconstruction, which enables us to improve the reconstruction quality by maintaining a high-level structure while enhancing the details of a given scene. We evaluate the proposed AiSDF on the ScanNet and ReplicaCAD datasets, where we demonstrate that the proposed framework is capable of reconstructing fine details of objects implicitly, as well as structures explicitly in room-scale scenes.
- [606] arXiv:2403.01875 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ICLN: Input Convex Loss Network for Decision Focused LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In decision-making problem under uncertainty, predicting unknown parameters is often considered independent of the optimization part. Decision-focused Learning (DFL) is a task-oriented framework to integrate prediction and optimization by adapting predictive model to give better decision for the corresponding task. Here, an inevitable challenge arises when computing gradients of the optimal decision with respect to the parameters. Existing researches cope this issue by smoothly reforming surrogate optimization or construct surrogate loss function that mimic task loss. However, they are applied to restricted optimization domain or build functions in a local manner leading a large computational time. In this paper, we propose Input Convex Loss Network (ICLN), a novel global surrogate loss which can be implemented in a general DFL paradigm. ICLN learns task loss via Input Convex Neural Networks which is guaranteed to be convex for some inputs, while keeping the global structure for the other inputs. This enables ICLN to admit general DFL through only a single surrogate loss without any sense for choosing appropriate parametric forms. We confirm effectiveness and flexibility of ICLN by evaluating our proposed model with three stochastic decision-making problems.
- [607] arXiv:2403.01886 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FCDS: Fusing Constituency and Dependency Syntax into Document-Level Relation ExtractionComments: Appear in COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Document-level Relation Extraction (DocRE) aims to identify relation labels between entities within a single document. It requires handling several sentences and reasoning over them. State-of-the-art DocRE methods use a graph structure to connect entities across the document to capture dependency syntax information. However, this is insufficient to fully exploit the rich syntax information in the document. In this work, we propose to fuse constituency and dependency syntax into DocRE. It uses constituency syntax to aggregate the whole sentence information and select the instructive sentences for the pairs of targets. It exploits the dependency syntax in a graph structure with constituency syntax enhancement and chooses the path between entity pairs based on the dependency graph. The experimental results on datasets from various domains demonstrate the effectiveness of the proposed method. The code is publicly available at this url.
- [608] arXiv:2403.01895 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Unsupervised Distance Metric Learning for Anomaly Detection Over Multivariate Time SeriesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Distance-based time series anomaly detection methods are prevalent due to their relative non-parametric nature and interpretability. However, the commonly used Euclidean distance is sensitive to noise. While existing works have explored dynamic time warping (DTW) for its robustness, they only support supervised tasks over multivariate time series (MTS), leaving a scarcity of unsupervised methods. In this work, we propose FCM-wDTW, an unsupervised distance metric learning method for anomaly detection over MTS, which encodes raw data into latent space and reveals normal dimension relationships through cluster centers. FCM-wDTW introduces locally weighted DTW into fuzzy C-means clustering and learns the optimal latent space efficiently, enabling anomaly identification via data reconstruction. Experiments with 11 different types of benchmarks demonstrate our method's competitive accuracy and efficiency.
- [609] arXiv:2403.01909 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Semi-Supervised Semantic Segmentation Based on Pseudo-Labels: A SurveySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Semantic segmentation is an important and popular research area in computer vision that focuses on classifying pixels in an image based on their semantics. However, supervised deep learning requires large amounts of data to train models and the process of labeling images pixel by pixel is time-consuming and laborious. This review aims to provide a first comprehensive and organized overview of the state-of-the-art research results on pseudo-label methods in the field of semi-supervised semantic segmentation, which we categorize from different perspectives and present specific methods for specific application areas. In addition, we explore the application of pseudo-label technology in medical and remote-sensing image segmentation. Finally, we also propose some feasible future research directions to address the existing challenges.
- [610] arXiv:2403.01915 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: xT: Nested Tokenization for Larger Context in Large ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to make the uncomfortable choice of which information to discard. We introduce xT, a simple framework for vision transformers which effectively aggregates global context with local details and can model large images end-to-end on contemporary GPUs. We select a set of benchmark datasets across classic vision tasks which accurately reflect a vision model's ability to understand truly large images and incorporate fine details over large scales and assess our method's improvement on them. By introducing a nested tokenization scheme for large images in conjunction with long-sequence length models normally used for natural language processing, we are able to increase accuracy by up to 8.6% on challenging classification tasks and $F_1$ score by 11.6 on context-dependent segmentation in large images.
- [611] arXiv:2403.01924 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question AnsweringSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Medical open-domain question answering demands substantial access to specialized knowledge. Recent efforts have sought to decouple knowledge from model parameters, counteracting architectural scaling and allowing for training on common low-resource hardware. The retrieve-then-read paradigm has become ubiquitous, with model predictions grounded on relevant knowledge pieces from external repositories such as PubMed, textbooks, and UMLS. An alternative path, still under-explored but made possible by the advent of domain-specific large language models, entails constructing artificial contexts through prompting. As a result, "to generate or to retrieve" is the modern equivalent of Hamlet's dilemma. This paper presents MedGENIE, the first generate-then-read framework for multiple-choice question answering in medicine. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU, incorporating a practical perspective by assuming a maximum of 24GB VRAM. MedGENIE sets a new state-of-the-art (SOTA) in the open-book setting of each testbed, even allowing a small-scale reader to outcompete zero-shot closed-book 175B baselines while using up to 706$\times$ fewer parameters. Overall, our findings reveal that generated passages are more effective than retrieved counterparts in attaining higher accuracy.
- [612] arXiv:2403.01954 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: DECIDER: A Rule-Controllable Decoding Strategy for Language Generation by Imitating Dual-System Cognitive TheoryChen Xu , Tian Lan , Changlong Yu , Wei Wang , Jun Gao , Yu Ji , Qunxi Dong , Kun Qian , Piji Li , Wei Bi , Bin HuComments: Submitted to IEEE TKDE (Major Revision), 12 pages, 6 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Abstract: Lexicon-based constrained decoding approaches aim to control the meaning or style of the generated text through certain target concepts. Existing approaches over-focus the targets themselves, leading to a lack of high-level reasoning about how to achieve them. However, human usually tackles tasks by following certain rules that not only focuses on the targets but also on semantically relevant concepts that induce the occurrence of targets. In this work, we present DECIDER, a rule-controllable decoding strategy for constrained language generation inspired by dual-system cognitive theory. Specifically, in DECIDER, a pre-trained language model (PLM) is equiped with a logic reasoner that takes high-level rules as input. Then, the DECIDER allows rule signals to flow into the PLM at each decoding step. Extensive experimental results demonstrate that DECIDER can effectively follow given rules to guide generation direction toward the targets in a more human-like manner.
- [613] arXiv:2403.01964 (cross-list from econ.GN) [ pdf , ps , html , other ]
-
Title: The Heterogeneous Productivity Effects of Generative AISubjects: General Economics (econ.GN) ; Artificial Intelligence (cs.AI)
Abstract: We analyse the individual productivity effects of Italy's ban on ChatGPT, a generative pretrained transformer chatbot. We compile data on the daily coding output quantity and quality of over 36,000 GitHub users in Italy and other European countries and combine these data with the sudden announcement of the ban in a difference-in-differences framework. Among the affected users in Italy, we find a short-term increase in output quantity and quality for less experienced users and a decrease in productivity on more routine tasks for experienced users.
- [614] arXiv:2403.01977 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: TTA-Nav: Test-time Adaptive Reconstruction for Point-Goal Navigation under Visual CorruptionsComments: Submitted to IROS2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Robot navigation under visual corruption presents a formidable challenge. To address this, we propose a Test-time Adaptation (TTA) method, named as TTA-Nav, for point-goal navigation under visual corruptions. Our "plug-and-play" method incorporates a top-down decoder to a pre-trained navigation model. Firstly, the pre-trained navigation model gets a corrupted image and extracts features. Secondly, the top-down decoder produces the reconstruction given the high-level features extracted by the pre-trained model. Then, it feeds the reconstruction of a corrupted image back to the pre-trained model. Finally, the pre-trained model does forward pass again to output action. Despite being trained solely on clean images, the top-down decoder can reconstruct cleaner images from corrupted ones without the need for gradient-based adaptation. The pre-trained navigation model with our top-down decoder significantly enhances navigation performance across almost all visual corruptions in our benchmarks. Our method improves the success rate of point-goal navigation from the state-of-the-art result of 46% to 94% on the most severe corruption. This suggests its potential for broader application in robotic visual navigation. Project page: this https URL
- [615] arXiv:2403.01985 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Transformers for Low-Resource Languages:Is F\'eidir Linn!Comments: 13 pagesJournal-ref: Proceedings of Machine Translation Summit XVIII: Research Track 2021Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The Transformer model is the state-of-the-art in Machine Translation. However, in general, neural translation models often under perform on language pairs with insufficient training data. As a consequence, relatively few experiments have been carried out using this architecture on low-resource language pairs. In this study, hyperparameter optimization of Transformer models in translating the low-resource English-Irish language pair is evaluated. We demonstrate that choosing appropriate parameters leads to considerable performance improvements. Most importantly, the correct choice of subword model is shown to be the biggest driver of translation performance. SentencePiece models using both unigram and BPE approaches were appraised. Variations on model architectures included modifying the number of layers, testing various regularisation techniques and evaluating the optimal number of heads for attention. A generic 55k DGT corpus and an in-domain 88k public admin corpus were used for evaluation. A Transformer optimized model demonstrated a BLEU score improvement of 7.8 points when compared with a baseline RNN model. Improvements were observed across a range of metrics, including TER, indicating a substantially reduced post editing effort for Transformer optimized models with 16k BPE subword models. Bench-marked against Google Translate, our translation engines demonstrated significant improvements. The question of whether or not Transformers can be used effectively in a low-resource setting of English-Irish translation has been addressed. Is féidir linn - yes we can.
- [616] arXiv:2403.02014 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Unveiling Hidden Links Between Unseen Security EntitiesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The proliferation of software vulnerabilities poses a significant challenge for security databases and analysts tasked with their timely identification, classification, and remediation. With the National Vulnerability Database (NVD) reporting an ever-increasing number of vulnerabilities, the traditional manual analysis becomes untenably time-consuming and prone to errors. This paper introduces VulnScopper, an innovative approach that utilizes multi-modal representation learning, combining Knowledge Graphs (KG) and Natural Language Processing (NLP), to automate and enhance the analysis of software vulnerabilities. Leveraging ULTRA, a knowledge graph foundation model, combined with a Large Language Model (LLM), VulnScopper effectively handles unseen entities, overcoming the limitations of previous KG approaches. We evaluate VulnScopper on two major security datasets, the NVD and the Red Hat CVE database. Our method significantly improves the link prediction accuracy between Common Vulnerabilities and Exposures (CVEs), Common Weakness Enumeration (CWEs), and Common Platform Enumerations (CPEs). Our results show that VulnScopper outperforms existing methods, achieving up to 78% Hits@10 accuracy in linking CVEs to CPEs and CWEs and presenting an 11.7% improvement over large language models in predicting CWE labels based on the Red Hat database. Based on the NVD, only 6.37% of the linked CPEs are being published during the first 30 days; many of them are related to critical and high-risk vulnerabilities which, according to multiple compliance frameworks (such as CISA and PCI), should be remediated within 15-30 days. Our model can uncover new products linked to vulnerabilities, reducing remediation time and improving vulnerability management. We analyzed several CVEs from 2023 to showcase this ability.
- [617] arXiv:2403.02018 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Cross Domain Policy Transfer with Effect Cycle-ConsistencyComments: Accepted to International Conference on Robotics and Automation (ICRA), 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Training a robotic policy from scratch using deep reinforcement learning methods can be prohibitively expensive due to sample inefficiency. To address this challenge, transferring policies trained in the source domain to the target domain becomes an attractive paradigm. Previous research has typically focused on domains with similar state and action spaces but differing in other aspects. In this paper, our primary focus lies in domains with different state and action spaces, which has broader practical implications, i.e. transfer the policy from robot A to robot B. Unlike prior methods that rely on paired data, we propose a novel approach for learning the mapping functions between state and action spaces across domains using unpaired data. We propose effect cycle consistency, which aligns the effects of transitions across two domains through a symmetrical optimization structure for learning these mapping functions. Once the mapping functions are learned, we can seamlessly transfer the policy from the source domain to the target domain. Our approach has been tested on three locomotion tasks and two robotic manipulation tasks. The empirical results demonstrate that our method can reduce alignment errors significantly and achieve better performance compared to the state-of-the-art method.
- [618] arXiv:2403.02074 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Modality-Aware and Shift Mixer for Multi-modal Brain Tumor SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Combining images from multi-modalities is beneficial to explore various information in computer vision, especially in the medical domain. As an essential part of clinical diagnosis, multi-modal brain tumor segmentation aims to delineate the malignant entity involving multiple modalities. Although existing methods have shown remarkable performance in the task, the information exchange for cross-scale and high-level representations fusion in spatial and modality are limited in these methods. In this paper, we present a novel Modality Aware and Shift Mixer that integrates intra-modality and inter-modality dependencies of multi-modal images for effective and robust brain tumor segmentation. Specifically, we introduce a Modality-Aware module according to neuroimaging studies for modeling the specific modality pair relationships at low levels, and a Modality-Shift module with specific mosaic patterns is developed to explore the complex relationships across modalities at high levels via the self-attention. Experimentally, we outperform previous state-of-the-art approaches on the public Brain Tumor Segmentation (BraTS 2021 segmentation) dataset. Further qualitative experiments demonstrate the efficacy and robustness of MASM.
- [619] arXiv:2403.02076 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPTComments: 15 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on this https URL
- [620] arXiv:2403.02107 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Iterated $Q$-Network: Beyond the One-Step Bellman OperatorComments: PreprintSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Value-based Reinforcement Learning (RL) methods rely on the application of the Bellman operator, which needs to be approximated from samples. Most approaches consist of an iterative scheme alternating the application of the Bellman operator and a subsequent projection step onto a considered function space. However, we observe that these algorithms can be improved by considering multiple iterations of the Bellman operator at once. Thus, we introduce iterated $Q$-Networks (iQN), a novel approach that learns a sequence of $Q$-function approximations where each $Q$-function serves as the target for the next one in a chain of consecutive Bellman iterations. We demonstrate that iQN is theoretically sound and show how it can be seamlessly used in value-based and actor-critic methods. We empirically demonstrate its advantages on Atari $2600$ games and in continuous-control MuJoCo environments.
- [621] arXiv:2403.02118 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Towards Implicit Prompt For Text-To-Image ModelsYue Yang , Yuqi lin , Hong Liu , Wenqi Shao , Runjian Chen , Hailong Shang , Yu Wang , Yu Qiao , Kaipeng Zhang , Ping LuoSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position paper highlights the current state of T2I models toward implicit prompts. We present a benchmark named ImplicitBench and conduct an investigation on the performance and impacts of implicit prompts with popular T2I models. Specifically, we design and collect more than 2,000 implicit prompts of three aspects: General Symbols, Celebrity Privacy, and Not-Safe-For-Work (NSFW) Issues, and evaluate six well-known T2I models' capabilities under these implicit prompts. Experiment results show that (1) T2I models are able to accurately create various target symbols indicated by implicit prompts; (2) Implicit prompts bring potential risks of privacy leakage for T2I models. (3) Constraints of NSFW in most of the evaluated T2I models can be bypassed with implicit prompts. We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.
- [622] arXiv:2403.02121 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language ModelsSargam Yadav (1), Abhishek Kaushik (1), Kevin McDaid (1) ((1) Dundalk Institute of Technology, Dundalk)Comments: This paper is accepted in the 16th ISDSI-Global Conference 2023 this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The advent of Large Language Models (LLMs) has advanced the benchmark in various Natural Language Processing (NLP) tasks. However, large amounts of labelled training data are required to train LLMs. Furthermore, data annotation and training are computationally expensive and time-consuming. Zero and few-shot learning have recently emerged as viable options for labelling data using large pre-trained models. Hate speech detection in mix-code low-resource languages is an active problem area where the use of LLMs has proven beneficial. In this study, we have compiled a dataset of 100 YouTube comments, and weakly labelled them for coarse and fine-grained misogyny classification in mix-code Hinglish. Weak annotation was applied due to the labor-intensive annotation process. Zero-shot learning, one-shot learning, and few-shot learning and prompting approaches have then been applied to assign labels to the comments and compare them to human-assigned labels. Out of all the approaches, zero-shot classification using the Bidirectional Auto-Regressive Transformers (BART) large model and few-shot prompting using Generative Pre-trained Transformer- 3 (ChatGPT-3) achieve the best results
- [623] arXiv:2403.02127 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LOCR: Location-Guided Transformer for Optical Character RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) this http URL tackle this issue, we propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset, from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in OOD marketing documents. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.
- [624] arXiv:2403.02131 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Deep Reinforcement Learning for Dynamic Algorithm Selection: A Proof-of-Principle Study on Differential EvolutionHongshu Guo , Yining Ma , Zeyuan Ma , Jiacheng Chen , Xinglin Zhang , Zhiguang Cao , Jun Zhang , Yue-Jiao GongComments: Accepted by IEEE Transactions on Systems, Man, and Cybernetics: Systems at Thu, Feb 29, 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Evolutionary algorithms, such as Differential Evolution, excel in solving real-parameter optimization challenges. However, the effectiveness of a single algorithm varies across different problem instances, necessitating considerable efforts in algorithm selection or configuration. This paper aims to address the limitation by leveraging the complementary strengths of a group of algorithms and dynamically scheduling them throughout the optimization progress for specific problems. We propose a deep reinforcement learning-based dynamic algorithm selection framework to accomplish this task. Our approach models the dynamic algorithm selection a Markov Decision Process, training an agent in a policy gradient manner to select the most suitable algorithm according to the features observed during the optimization process. To empower the agent with the necessary information, our framework incorporates a thoughtful design of landscape and algorithmic features. Meanwhile, we employ a sophisticated deep neural network model to infer the optimal action, ensuring informed algorithm selections. Additionally, an algorithm context restoration mechanism is embedded to facilitate smooth switching among different algorithms. These mechanisms together enable our framework to seamlessly select and switch algorithms in a dynamic online fashion. Notably, the proposed framework is simple and generic, offering potential improvements across a broad spectrum of evolutionary algorithms. As a proof-of-principle study, we apply this framework to a group of Differential Evolution algorithms. The experimental results showcase the remarkable effectiveness of the proposed framework, not only enhancing the overall optimization performance but also demonstrating favorable generalization ability across different problem classes.
- [625] arXiv:2403.02167 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: Speech emotion recognition from voice messages recorded in the wildLucía Gómez-Zaragozá , Óscar Valls , Rocío del Amor , María José Castro-Bleda , Valery Naranjo , Mariano Alcañiz Raya , Javier Marín-MoralesComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Abstract: Emotion datasets used for Speech Emotion Recognition (SER) often contain acted or elicited speech, limiting their applicability in real-world scenarios. In this work, we used the Emotional Voice Messages (EMOVOME) database, including spontaneous voice messages from conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We created speaker-independent SER models using the eGeMAPS features, transformer-based models and their combination. We compared the results with reference databases and analyzed the influence of annotators and gender fairness. The pre-trained Unispeech-L model and its combination with eGeMAPS achieved the highest results, with 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS database. The elicited IEMOCAP database also outperformed EMOVOME in the prediction of emotion categories, while similar results were obtained in valence and arousal. Additionally, EMOVOME outcomes varied with annotator labels, showing superior results and better fairness when combining expert and non-expert annotations. This study significantly contributes to the evaluation of SER models in real-life situations, advancing in the development of applications for analyzing spontaneous voice messages.
- [626] arXiv:2403.02178 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language ModelsChangyu Chen , Xiting Wang , Ting-En Lin , Ang Lv , Yuchuan Wu , Xin Gao , Ji-Rong Wen , Rui Yan , Yongbin LiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In reasoning tasks, even a minor error can cascade into inaccurate results, leading to suboptimal performance of large language models in such domains. Earlier fine-tuning approaches sought to mitigate this by leveraging more precise supervisory signals from human labeling, larger models, or self-sampling, although at a high cost. Conversely, we develop a method that avoids external resources, relying instead on introducing perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks. When applied to fine-tuning with GSM8K, this method achieved a 5% improvement in accuracy over standard supervised fine-tuning with a few codes modified and no additional labeling effort. Furthermore, it is complementary to existing methods. When integrated with related data augmentation methods, it leads to an average improvement of 3% improvement in GSM8K accuracy and 1% improvement in MATH accuracy across five datasets of various quality and size, as well as two base models. We further investigate the mechanisms behind this improvement through case studies and quantitative analysis, suggesting that our approach may provide superior support for the model in capturing long-distance dependencies, especially those related to questions. This enhancement could deepen understanding of premises in questions and prior steps. Our code is available at Github.
- [627] arXiv:2403.02181 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Not all Layers of LLMs are Necessary during InferenceSiqi Fan , Xin Jiang , Xiang Li , Xuying Meng , Peng Han , Shuo Shang , Aixin Sun , Yequan Wang , Zhongyuan WangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The inference phase of Large Language Models (LLMs) is very expensive. An ideal inference stage of LLMs could utilize fewer computational resources while still maintaining its capabilities (e.g., generalization and in-context learning ability). In this paper, we try to answer the question, "During LLM inference, can we use shallow layers for easy instances; and deep layers for hard ones?" To answer this question, we first indicate that Not all Layers are Necessary during Inference by statistically analyzing the activated layers across tasks. Then, we propose a simple algorithm named AdaInfer to determine the inference termination moment based on the input instance adaptively. More importantly, AdaInfer does not alter LLM parameters and maintains generalizability across tasks. Experiments on well-known LLMs (i.e., Llama2 series and OPT) show that AdaInfer saves an average of 14.8% of computational resources, even up to 50% on sentiment tasks, while maintaining comparable performance. Additionally, this method is orthogonal to other model acceleration techniques, potentially boosting inference efficiency further.
- [628] arXiv:2403.02227 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Policy Space Response Oracles: A SurveyComments: Ariyan Bighashdel and Yongzhao Wang contributed equallySubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: In game theory, a game refers to a model of interaction among rational decision-makers or players, making choices with the goal of achieving their individual objectives. Understanding their behavior in games is often referred to as game reasoning. This survey provides a comprehensive overview of a fast-developing game-reasoning framework for large games, known as Policy Space Response Oracles (PSRO). We first motivate PSRO, provide historical context, and position PSRO within game-reasoning approaches. We then focus on the strategy exploration issue for PSRO, the challenge of assembling an effective strategy portfolio for modeling the underlying game with minimum computational cost. We also survey current research directions for enhancing the efficiency of PSRO, and explore the applications of PSRO across various domains. We conclude by discussing open questions and future research.
- [629] arXiv:2403.02232 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detectionJournal-ref: International Journal of Computer Science and Information Technology, 2024, 2(1), 1-9Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study conducts a thorough examination of malware detection using machine learning techniques, focusing on the evaluation of various classification models using the Mal-API-2019 dataset. The aim is to advance cybersecurity capabilities by identifying and mitigating threats more effectively. Both ensemble and non-ensemble machine learning methods, such as Random Forest, XGBoost, K Nearest Neighbor (KNN), and Neural Networks, are explored. Special emphasis is placed on the importance of data pre-processing techniques, particularly TF-IDF representation and Principal Component Analysis, in improving model performance. Results indicate that ensemble methods, particularly Random Forest and XGBoost, exhibit superior accuracy, precision, and recall compared to others, highlighting their effectiveness in malware detection. The paper also discusses limitations and potential future directions, emphasizing the need for continuous adaptation to address the evolving nature of malware. This research contributes to ongoing discussions in cybersecurity and provides practical insights for developing more robust malware detection systems in the digital era.
- [630] arXiv:2403.02238 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Towards Intent-Based Network Management: Large Language Models for Intent Extraction in 5G Core NetworksComments: Submitted to: International Conference on the Design of Reliable Communication Networks 2024Subjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: The integration of Machine Learning and Artificial Intelligence (ML/AI) into fifth-generation (5G) networks has made evident the limitations of network intelligence with ever-increasing, strenuous requirements for current and next-generation devices. This transition to ubiquitous intelligence demands high connectivity, synchronicity, and end-to-end communication between users and network operators, and will pave the way towards full network automation without human intervention. Intent-based networking is a key factor in the reduction of human actions, roles, and responsibilities while shifting towards novel extraction and interpretation of automated network management. This paper presents the development of a custom Large Language Model (LLM) for 5G and next-generation intent-based networking and provides insights into future LLM developments and integrations to realize end-to-end intent-based networking for fully automated network intelligence.
- [631] arXiv:2403.02241 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Neural Redshift: Random Networks are not Random FunctionsJournal-ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Our understanding of the generalization capabilities of neural networks (NNs) is still incomplete. Prevailing explanations are based on implicit biases of gradient descent (GD) but they cannot account for the capabilities of models from gradient-free methods nor the simplicity bias recently observed in untrained networks. This paper seeks other sources of generalization in NNs.
Findings. To understand the inductive biases provided by architectures independently from GD, we examine untrained, random-weight networks. Even simple MLPs show strong inductive biases: uniform sampling in weight space yields a very biased distribution of functions in terms of complexity. But unlike common wisdom, NNs do not have an inherent "simplicity bias". This property depends on components such as ReLUs, residual connections, and layer normalizations. Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks.
Implications. We provide a fresh explanation for the success of deep learning independent from gradient-based training. It points at promising avenues for controlling the solutions implemented by trained models. - [632] arXiv:2403.02243 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Better Schedules for Low Precision Training of Deep Neural NetworksComments: 20 pages, 8 figures, 1 table, ACML 2023Journal-ref: Machine Learning (2024): 1-19Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Low precision training can significantly reduce the computational overhead of training deep neural networks (DNNs). Though many such techniques exist, cyclic precision training (CPT), which dynamically adjusts precision throughout training according to a cyclic schedule, achieves particularly impressive improvements in training efficiency, while actually improving DNN performance. Existing CPT implementations take common learning rate schedules (e.g., cyclical cosine schedules) and use them for low precision training without adequate comparisons to alternative scheduling options. We define a diverse suite of CPT schedules and analyze their performance across a variety of DNN training regimes, some of which are unexplored in the low precision training literature (e.g., node classification with graph neural networks). From these experiments, we discover alternative CPT schedules that offer further improvements in training efficiency and model performance, as well as derive a set of best practices for choosing CPT schedules. Going further, we find that a correlation exists between model performance and training cost, and that changing the underlying CPT schedule can control the tradeoff between these two variables. To explain the direct correlation between model performance and training cost, we draw a connection between quantized training and critical learning periods, suggesting that aggressive quantization is a form of learning impairment that can permanently damage model performance.
- [633] arXiv:2403.02249 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Non-autoregressive Sequence-to-Sequence Vision-Language ModelsComments: Accepted to CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.
- [634] arXiv:2403.02253 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: KnowPhish: Large Language Models Meet Multimodal Knowledge Graphs for Enhancing Reference-Based Phishing DetectionYuexin Li , Chengyu Huang , Shumin Deng , Mei Lin Lock , Tri Cao , Nay Oo , Bryan Hooi , Hoon Wei LimSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Phishing attacks have inflicted substantial losses on individuals and businesses alike, necessitating the development of robust and efficient automated phishing detection approaches. Reference-based phishing detectors (RBPDs), which compare the logos on a target webpage to a known set of logos, have emerged as the state-of-the-art approach. However, a major limitation of existing RBPDs is that they rely on a manually constructed brand knowledge base, making it infeasible to scale to a large number of brands, which results in false negative errors due to the insufficient brand coverage of the knowledge base. To address this issue, we propose an automated knowledge collection pipeline, using which we collect and release a large-scale multimodal brand knowledge base, KnowPhish, containing 20k brands with rich information about each brand. KnowPhish can be used to boost the performance of existing RBPDs in a plug-and-play manner. A second limitation of existing RBPDs is that they solely rely on the image modality, ignoring useful textual information present in the webpage HTML. To utilize this textual information, we propose a Large Language Model (LLM)-based approach to extract brand information of webpages from text. Our resulting multimodal phishing detection approach, KnowPhish Detector (KPD), can detect phishing webpages with or without logos. We evaluate KnowPhish and KPD on a manually validated dataset, and on a field study under Singapore's local context, showing substantial improvements in effectiveness and efficiency compared to state-of-the-art baselines.
- [635] arXiv:2403.02268 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Subjective $\textit{Isms}$? On the Danger of Conflating Hate and Offence in Abusive Language DetectionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Natural language processing research has begun to embrace the notion of annotator subjectivity, motivated by variations in labelling. This approach understands each annotator's view as valid, which can be highly suitable for tasks that embed subjectivity, e.g., sentiment analysis. However, this construction may be inappropriate for tasks such as hate speech detection, as it affords equal validity to all positions on e.g., sexism or racism. We argue that the conflation of hate and offence can invalidate findings on hate speech, and call for future work to be situated in theory, disentangling hate from its orthogonal concept, offence.
- [636] arXiv:2403.02302 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Multimodal Large Language Models (MLLMs) have recently gained immense popularity. Powerful commercial models like ChatGPT-4V and Gemini, as well as open-source ones such as LLaVA, are essentially general-purpose models and are applied to solve a wide variety of tasks, including those in computer vision. These neural networks possess such strong general knowledge and reasoning abilities that they have proven capable of working even on tasks for which they were not specifically trained. We compared the capabilities of the most powerful MLLMs to date: ShareGPT4V, ChatGPT, LLaVA-Next in a specialized task of age and gender estimation with our state-of-the-art specialized model, MiVOLO. We also updated MiVOLO and provide details and new metrics in this article. This comparison has yielded some interesting results and insights about the strengths and weaknesses of the participating models. Furthermore, we attempted various ways to fine-tune the ShareGPT4V model for this specific task, aiming to achieve state-of-the-art results in this particular challenge. Although such a model would not be practical in production, as it is incredibly expensive compared to a specialized model like MiVOLO, it could be very useful in some tasks, like data annotation.
- [637] arXiv:2403.02325 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Contrastive Region Guidance: Improving Grounding in Vision-Language Models without TrainingComments: Project website: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp, as well as to compositional generalization -- improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe -- and to image-text alignment for generated images, where we improve by up to 8.4 AUROC and 6.8 F1 points on SeeTRUE. When reference regions are absent, CRG allows us to re-rank proposed regions in referring expression comprehension and phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an average gain of 3.2% in accuracy. Our analysis explores alternative masking strategies for CRG, quantifies CRG's probability shift, and evaluates the role of region guidance strength, empirically validating CRG's design choices.
- [638] arXiv:2403.02327 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: Model LakesSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI)
Abstract: Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of machine learning models increases, this issue of finding, differentiating, and understanding models is becoming more crucial. Inspired from research on data lakes, we introduce and define the concept of model lakes. We discuss fundamental research challenges in the management of large models. And we discuss what principled data management techniques can be brought to bear on the study of large model management.
- [639] arXiv:2403.02333 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Key-Point-Driven Data Synthesis with its Enhancement on Mathematical ReasoningComments: In progressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% PASS@1 accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 70B range and best commercial models like GPT-4 across multiple math reasoning datasets.
- [640] arXiv:2403.02334 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Gradient Correlation Subspace Learning against Catastrophic ForgettingComments: 5 figures; Code will be available here: this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Efficient continual learning techniques have been a topic of significant research over the last few years. A fundamental problem with such learning is severe degradation of performance on previously learned tasks, known also as catastrophic forgetting. This paper introduces a novel method to reduce catastrophic forgetting in the context of incremental class learning called Gradient Correlation Subspace Learning (GCSL). The method detects a subspace of the weights that is least affected by previous tasks and projects the weights to train for the new task into said subspace. The method can be applied to one or more layers of a given network architectures and the size of the subspace used can be altered from layer to layer and task to task. Code will be available at \href{ this https URL }{ this https URL }
- [641] arXiv:2403.02336 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Brand Visibility in Packaging: A Deep Learning Approach for Logo Detection, Saliency-Map Prediction, and Logo Placement AnalysisAlireza Hosseini , Kiana Hooshanfar , Pouria Omrani , Reza Toosi , Ramin Toosi , Zahra Ebrahimian , Mohammad Ali AkhaeeSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the highly competitive area of product marketing, the visibility of brand logos on packaging plays a crucial role in shaping consumer perception, directly influencing the success of the product. This paper introduces a comprehensive framework to measure the brand logo's attention on a packaging design. The proposed method consists of three steps. The first step leverages YOLOv8 for precise logo detection across prominent datasets, FoodLogoDet-1500 and LogoDet-3K. The second step involves modeling the user's visual attention with a novel saliency prediction model tailored for the packaging context. The proposed saliency model combines the visual elements with text maps employing a transformers-based architecture to predict user attention maps. In the third step, by integrating logo detection with a saliency map generation, the framework provides a comprehensive brand attention score. The effectiveness of the proposed method is assessed module by module, ensuring a thorough evaluation of each component. Comparing logo detection and saliency map prediction with state-of-the-art models shows the superiority of the proposed methods. To investigate the robustness of the proposed brand attention score, we collected a unique dataset to examine previous psychophysical hypotheses related to brand visibility. the results show that the brand attention score is in line with all previous studies. Also, we introduced seven new hypotheses to check the impact of position, orientation, presence of person, and other visual elements on brand attention. This research marks a significant stride in the intersection of cognitive psychology, computer vision, and marketing, paving the way for advanced, consumer-centric packaging designs.
- [642] arXiv:2403.02338 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Twisting Lids Off with Two HandsComments: Project page can be found at this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Manipulating objects with two multi-fingered hands has been a long-standing challenge in robotics, attributed to the contact-rich nature of many manipulation tasks and the complexity inherent in coordinating a high-dimensional bimanual system. In this work, we consider the problem of twisting lids of various bottle-like objects with two hands, and demonstrate that policies trained in simulation using deep reinforcement learning can be effectively transferred to the real world. With novel engineering insights into physical modeling, real-time perception, and reward design, the policy demonstrates generalization capabilities across a diverse set of unseen objects, showcasing dynamic and dexterous behaviors. Our findings serve as compelling evidence that deep reinforcement learning combined with sim-to-real transfer remains a promising approach for addressing manipulation problems of unprecedented complexity.
- [643] arXiv:2403.02342 (cross-list from physics.soc-ph) [ pdf , ps , other ]
-
Title: Entanglement: Balancing Punishment and Compensation, Repeated Dilemma Game-Theoretic Analysis of Maximum Compensation Problem for Bypass and Least Cost Paths in Fact-Checking, Case of Fake News with Weak Wallace's LawComments: Recurring Dilemma, Wallace's Law, Entanglement, Detour Path, Least Cost Path, Metzler Function, Metzler Matrix, Fake News, Fact-Checking, Punitive Dominance Problem, Maximum Compensation Problem, Informational healthSubjects: Physics and Society (physics.soc-ph) ; Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
Abstract: This research note is organized with respect to a novel approach to solving problems related to the spread of fake news and effective fact-checking. Focusing on the least-cost routing problem, the discussion is organized with respect to the use of Metzler functions and Metzler matrices to model the dynamics of information propagation among news providers. With this approach, we designed a strategy to minimize the spread of fake news, which is detrimental to informational health, while at the same time maximizing the spread of credible information. In particular, through the punitive dominance problem and the maximum compensation problem, we developed and examined a path to reassess the incentives of news providers to act and to analyze their impact on the equilibrium of the information market. By applying the concept of entanglement to the context of information propagation, we shed light on the complexity of interactions among news providers and contribute to the formulation of more effective information management strategies. This study provides new theoretical and practical insights into issues related to fake news and fact-checking, and will be examined against improving informational health and public digital health.This paper is partially an attempt to utilize "Generative AI" and was written with educational intent. There are currently no plans for it to become a peer-reviewed paper.
- [644] arXiv:2403.02352 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: ATP: Enabling Fast LLM Serving via Attention on Top Principal KeysComments: 10 pages, 7 figures, 8 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We propose a new attention mechanism with linear complexity, ATP, that fixates \textbf{A}ttention on \textbf{T}op \textbf{P}rincipal keys, rather than on each individual token. Particularly, ATP is driven by an important observation that input sequences are typically low-rank, i.e., input sequences can be represented by a few principal bases. Therefore, instead of directly iterating over all the input tokens, ATP transforms inputs into an orthogonal space and computes attention only on the top principal bases (keys). Owing to the observed low-rank structure in input sequences, ATP is able to capture semantic relationships in input sequences with a few principal keys. Furthermore, the attention complexity is reduced from \emph{quadratic} to \emph{linear} without incurring a noticeable performance drop. ATP further reduces complexity for other linear layers with low-rank inputs, leading to more speedup compared to prior works that solely target the attention module. Our evaluations on various models (e.g., BERT and Llama) demonstrate that ATP achieves comparable accuracy with much lower computation and memory complexity than the standard attention mechanism. In particular, ATP barely loses accuracy with only $1/2$ principal keys, and only incurs around $2\%$ accuracy drops with $1/4$ principal keys.
- [645] arXiv:2403.02354 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Spatio-Temporal Field Neural Networks for Air Quality InferenceYutong Feng , Qiongyan Wang , Yutong Xia , Junlin Huang , Siru Zhong , Kun Wang , Shifen Cheng , Yuxuan LiangSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The air quality inference problem aims to utilize historical data from a limited number of observation sites to infer the air quality index at an unknown location. Considering the sparsity of data due to the high maintenance cost of the stations, good inference algorithms can effectively save the cost and refine the data granularity. While spatio-temporal graph neural networks have made excellent progress on this problem, their non-Euclidean and discrete data structure modeling of reality limits its potential. In this work, we make the first attempt to combine two different spatio-temporal perspectives, fields and graphs, by proposing a new model, Spatio-Temporal Field Neural Network, and its corresponding new framework, Pyramidal Inference. Extensive experiments validate that our model achieves state-of-the-art performance in nationwide air quality inference in the Chinese Mainland, demonstrating the superiority of our proposed model and framework.
- [646] arXiv:2403.02355 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Temporal Knowledge Graph Completion with Time-sensitive Relations in Hypercomplex SpaceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Temporal knowledge graph completion (TKGC) aims to fill in missing facts within a given temporal knowledge graph at a specific time. Existing methods, operating in real or complex spaces, have demonstrated promising performance in this task. This paper advances beyond conventional approaches by introducing more expressive quaternion representations for TKGC within hypercomplex space. Unlike existing quaternion-based methods, our study focuses on capturing time-sensitive relations rather than time-aware entities. Specifically, we model time-sensitive relations through time-aware rotation and periodic time translation, effectively capturing complex temporal variability. Furthermore, we theoretically demonstrate our method's capability to model symmetric, asymmetric, inverse, compositional, and evolutionary relation patterns. Comprehensive experiments on public datasets validate that our proposed approach achieves state-of-the-art performance in the field of TKGC.
- [647] arXiv:2403.02360 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Optimal Customized Architecture for Heterogeneous Federated Learning with Contrastive Cloud-Edge Model DecouplingXingyan Chen , Tian Du , Mu Wang , Tiancheng Gu , Yu Zhao , Gang Kou , Changqiao Xu , Dapeng Oliver WuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Federated learning, as a promising distributed learning paradigm, enables collaborative training of a global model across multiple network edge clients without the need for central data collecting. However, the heterogeneity of edge data distribution drags the model towards the local minima, which can be distant from the global optimum. Such heterogeneity often leads to slow convergence and substantial communication overhead. To address these issues, we propose a novel federated learning framework called FedCMD, a model decoupling tailored to the Cloud-edge supported federated learning that separates deep neural networks into a body for capturing shared representations in Cloud and a personalized head for migrating data heterogeneity. Our motivation is that, by the deep investigation of the performance of selecting different neural network layers as the personalized head, we found rigidly assigning the last layer as the personalized head in current studies is not always optimal. Instead, it is necessary to dynamically select the personalized layer that maximizes the training performance by taking the representation difference between neighbor layers into account. To find the optimal personalized layer, we utilize the low-dimensional representation of each layer to contrast feature distribution transfer and introduce a Wasserstein-based layer selection method, aimed at identifying the best-match layer for personalization. Additionally, a weighted global aggregation algorithm is proposed based on the selected personalized layer for the practical application of FedCMD. Extensive experiments on ten benchmarks demonstrate the efficiency and superior performance of our solution compared with nine state-of-the-art solutions. All code and results are available at this https URL .
- [648] arXiv:2403.02363 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Addressing Long-Tail Noisy Label Learning Problems: a Two-Stage Solution with Label Refurbishment Considering Label RaritySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Real-world datasets commonly exhibit noisy labels and class imbalance, such as long-tailed distributions. While previous research addresses this issue by differentiating noisy and clean samples, reliance on information from predictions based on noisy long-tailed data introduces potential errors. To overcome the limitations of prior works, we introduce an effective two-stage approach by combining soft-label refurbishing with multi-expert ensemble learning. In the first stage of robust soft label refurbishing, we acquire unbiased features through contrastive learning, making preliminary predictions using a classifier trained with a carefully designed BAlanced Noise-tolerant Cross-entropy (BANC) loss. In the second stage, our label refurbishment method is applied to obtain soft labels for multi-expert ensemble learning, providing a principled solution to the long-tail noisy label problem. Experiments conducted across multiple benchmarks validate the superiority of our approach, Label Refurbishment considering Label Rarity (LR^2), achieving remarkable accuracies of 94.19% and 77.05% on simulated noisy CIFAR-10 and CIFAR-100 long-tail datasets, as well as 77.74% and 81.40% on real-noise long-tail datasets, Food-101N and Animal-10N, surpassing existing state-of-the-art methods.
- [649] arXiv:2403.02366 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Human Evaluation of English--Irish Transformer-Based NMTComments: arXiv admin note: text overlap with arXiv:2403.01985Journal-ref: Information 2022, 13(7), 309Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this study, a human evaluation is carried out on how hyperparameter settings impact the quality of Transformer-based Neural Machine Translation (NMT) for the low-resourced English--Irish pair. SentencePiece models using both Byte Pair Encoding (BPE) and unigram approaches were appraised. Variations in model architectures included modifying the number of layers, evaluating the optimal number of heads for attention and testing various regularisation techniques. The greatest performance improvement was recorded for a Transformer-optimized model with a 16k BPE subword model. Compared with a baseline Recurrent Neural Network (RNN) model, a Transformer-optimized model demonstrated a BLEU score improvement of 7.8 points. When benchmarked against Google Translate, our translation engines demonstrated significant improvements. Furthermore, a quantitative fine-grained manual evaluation was conducted which compared the performance of machine translation systems. Using the Multidimensional Quality Metrics (MQM) error taxonomy, a human evaluation of the error types generated by an RNN-based system and a Transformer-based system was explored. Our findings show the best-performing Transformer system significantly reduces both accuracy and fluency errors when compared with an RNN-based model.
- [650] arXiv:2403.02367 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: adaptNMT: an open-source, language-agnostic development environment for Neural Machine TranslationJournal-ref: Language Resources and Evaluation 57, 1671-1696, (2023)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: adaptNMT streamlines all processes involved in the development and deployment of RNN and Transformer neural translation models. As an open-source application, it is designed for both technical and non-technical users who work in the field of machine translation. Built upon the widely-adopted OpenNMT ecosystem, the application is particularly useful for new entrants to the field since the setup of the development environment and creation of train, validation and test splits is greatly simplified. Graphing, embedded within the application, illustrates the progress of model training, and SentencePiece is used for creating subword segmentation models. Hyperparameter customization is facilitated through an intuitive user interface, and a single-click model development approach has been implemented. Models developed by adaptNMT can be evaluated using a range of metrics, and deployed as a translation service within the application. To support eco-friendly research in the NLP space, a green report also flags the power consumption and kgCO$_{2}$ emissions generated during model development. The application is freely available.
- [651] arXiv:2403.02368 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Novel Hybrid Feature Importance and Feature Interaction Detection Framework for Predictive Optimization in Industry 4.0 ApplicationsJournal-ref: IECON 2023- 49th Annual Conference of the IEEE Industrial Electronics SocietySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Advanced machine learning algorithms are increasingly utilized to provide data-based prediction and decision-making support in Industry 4.0. However, the prediction accuracy achieved by the existing models is insufficient to warrant practical implementation in real-world applications. This is because not all features present in real-world datasets possess a direct relevance to the predictive analysis being conducted. Consequently, the careful incorporation of select features has the potential to yield a substantial positive impact on the outcome. To address the research gap, this paper proposes a novel hybrid framework that combines the feature importance detector - local interpretable model-agnostic explanations (LIME) and the feature interaction detector - neural interaction detection (NID), to improve prediction accuracy. By applying the proposed framework, unnecessary features can be eliminated, and interactions are encoded to generate a more conducive dataset for predictive purposes. Subsequently, the proposed model is deployed to refine the prediction of electricity consumption in foundry processing. The experimental outcomes reveal an augmentation of up to 9.56% in the R2 score, and a diminution of up to 24.05% in the root mean square error.
- [652] arXiv:2403.02370 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM PlaygroundsJournal-ref: Information 2023, 14(12), 638Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The advent of Multilingual Language Models (MLLMs) and Large Language Models has spawned innovation in many areas of natural language processing. Despite the exciting potential of this technology, its impact on developing high-quality Machine Translation (MT) outputs for low-resource languages remains relatively under-explored. Furthermore, an open-source application, dedicated to both fine-tuning MLLMs and managing the complete MT workflow for low-resources languages, remains unavailable. We aim to address these imbalances through the development of adaptMLLM, which streamlines all processes involved in the fine-tuning of MLLMs for MT. This open-source application is tailored for developers, translators, and users who are engaged in MT. An intuitive interface allows for easy customisation of hyperparameters, and the application offers a range of metrics for model evaluation and the capability to deploy models as a translation service directly within the application. As a multilingual tool, we used adaptMLLM to fine-tune models for two low-resource language pairs: English to Irish (EN$\leftrightarrow$GA) and English to Marathi (EN$\leftrightarrow$MR). Compared with baselines from the LoResMT2021 Shared Task, the adaptMLLM system demonstrated significant improvements. In the EN$\rightarrow$GA direction, an improvement of 5.2 BLEU points was observed and an increase of 40.5 BLEU points was recorded in the GA$\rightarrow$EN direction. Significant improvements in the translation performance of the EN$\leftrightarrow$MR pair were also observed notably in the MR$\rightarrow$EN direction with an increase of 21.3 BLEU points. Finally, a fine-grained human evaluation of the MLLM output on the EN$\rightarrow$GA pair was conducted using the Multidimensional Quality Metrics and Scalar Quality Metrics error taxonomies. The application and models are freely available.
- [653] arXiv:2403.02371 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: NeuroVoz: a Castillian Spanish corpus of parkinsonian speechJanaína Mendes-Laureano , Jorge A. Gómez-García , Alejandro Guerrero-López , Elisa Luque-Buzo , Julián D. Arias-Londoño , Francisco J. Grandas-Pérez , Juan I. Godino-LlorenteComments: Preprint versionSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Abstract: The advancement of Parkinson's Disease (PD) diagnosis through speech analysis is hindered by a notable lack of publicly available, diverse language datasets, limiting the reproducibility and further exploration of existing research.
In response to this gap, we introduce a comprehensive corpus from 108 native Castilian Spanish speakers, comprising 55 healthy controls and 53 individuals diagnosed with PD, all of whom were under pharmacological treatment and recorded in their medication-optimized state. This unique dataset features a wide array of speech tasks, including sustained phonation of the five Spanish vowels, diadochokinetic tests, 16 listen-and-repeat utterances, and free monologues. The dataset emphasizes accuracy and reliability through specialist manual transcriptions of the listen-and-repeat tasks and utilizes Whisper for automated monologue transcriptions, making it the most complete public corpus of Parkinsonian speech, and the first in Castillian Spanish.
NeuroVoz is composed by 2,903 audio recordings averaging $26.88 \pm 3.35$ recordings per participant, offering a substantial resource for the scientific exploration of PD's impact on speech. This dataset has already underpinned several studies, achieving a benchmark accuracy of 89% in PD speech pattern identification, indicating marked speech alterations attributable to PD. Despite these advances, the broader challenge of conducting a language-agnostic, cross-corpora analysis of Parkinsonian speech patterns remains an open area for future research. This contribution not only fills a critical void in PD speech analysis resources but also sets a new standard for the global research community in leveraging speech as a diagnostic tool for neurodegenerative diseases. - [654] arXiv:2403.02372 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: OTClean: Data Cleaning for Conditional Independence Violations using Optimal TransportSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Databases (cs.DB)
Abstract: Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.
- [655] arXiv:2403.02419 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference SystemsLingjiao Chen , Jared Quincy Davis , Boris Hanin , Peter Bailis , Ion Stoica , Matei Zaharia , James ZouSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
Abstract: Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Large Language Model (LLM) calls and aggregate their responses. However, there is little understanding of how the number of LLM calls -- e.g., when asking the LLM to answer each question multiple times and taking a consensus -- affects such a compound system's performance. In this paper, we initiate the study of scaling laws of compound inference systems. We analyze, theoretically and empirically, how the number of LLM calls affects the performance of one-layer Voting Inference Systems -- one of the simplest compound systems, which aggregates LLM responses via majority voting. We find empirically that across multiple language tasks, surprisingly, Voting Inference Systems' performance first increases but then decreases as a function of the number of LLM calls. Our theoretical results suggest that this non-monotonicity is due to the diversity of query difficulties within a task: more LLM calls lead to higher performance on "easy" queries, but lower performance on "hard" queries, and non-monotone behavior emerges when a task contains both types of queries. This insight then allows us to compute, from a small number of samples, the number of LLM calls that maximizes system performance, and define a scaling law of Voting Inference Systems. Experiments show that our scaling law can predict the performance of Voting Inference Systems and find the optimal number of LLM calls to make.
- [656] arXiv:2403.02429 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards efficient deep autoencoders for multivariate time series anomaly detectionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Multivariate time series anomaly detection is a crucial problem in many industrial and research applications. Timely detection of anomalies allows, for instance, to prevent defects in manufacturing processes and failures in cyberphysical systems. Deep learning methods are preferred among others for their accuracy and robustness for the analysis of complex multivariate data. However, a key aspect is being able to extract predictions in a timely manner, to accommodate real-time requirements in different applications. In the case of deep learning models, model reduction is extremely important to achieve optimal results in real-time systems with limited time and memory constraints. In this paper, we address this issue by proposing a novel compression method for deep autoencoders that involves three key factors. First, pruning reduces the number of weights, while preventing catastrophic drops in accuracy by means of a fast search process that identifies high sparsity levels. Second, linear and non-linear quantization reduces model complexity by reducing the number of bits for every single weight. The combined contribution of these three aspects allow the model size to be reduced, by removing a subset of the weights (pruning), and decreasing their bit-width (quantization). As a result, the compressed model is faster and easier to adopt in highly constrained hardware environments. Experiments performed on popular multivariate anomaly detection benchmarks, show that our method is capable of achieving significant model compression ratio (between 80% and 95%) without a significant reduction in the anomaly detection performance.
- [657] arXiv:2403.02437 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: SoK: Challenges and Opportunities in Federated UnlearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated learning (FL), introduced in 2017, facilitates collaborative learning between non-trusting parties with no need for the parties to explicitly share their data among themselves. This allows training models on user data while respecting privacy regulations such as GDPR and CPRA. However, emerging privacy requirements may mandate model owners to be able to \emph{forget} some learned data, e.g., when requested by data owners or law enforcement. This has given birth to an active field of research called \emph{machine unlearning}. In the context of FL, many techniques developed for unlearning in centralized settings are not trivially applicable! This is due to the unique differences between centralized and distributed learning, in particular, interactivity, stochasticity, heterogeneity, and limited accessibility in FL. In response, a recent line of work has focused on developing unlearning mechanisms tailored to FL.
This SoK paper aims to take a deep look at the \emph{federated unlearning} literature, with the goal of identifying research trends and challenges in this emerging field. By carefully categorizing papers published on FL unlearning (since 2020), we aim to pinpoint the unique complexities of federated unlearning, highlighting limitations on directly applying centralized unlearning methods. We compare existing federated unlearning methods regarding influence removal and performance recovery, compare their threat models and assumptions, and discuss their implications and limitations. For instance, we analyze the experimental setup of FL unlearning studies from various perspectives, including data heterogeneity and its simulation, the datasets used for demonstration, and evaluation metrics. Our work aims to offer insights and suggestions for future research on federated unlearning. - [658] arXiv:2403.02439 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Root Causing Prediction Anomalies Using Explainable AIComments: Submitted to The 2nd World Conference on eXplainable Artificial Intelligence, 17-19 July, 2024, Malta, VallettaSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a novel application of explainable AI (XAI) for root-causing performance degradation in machine learning models that learn continuously from user engagement data. In such systems a single feature corruption can cause cascading feature, label and concept drifts. We have successfully applied this technique to improve the reliability of models used in personalized advertising. Performance degradation in such systems manifest as prediction anomalies in the models. These models are typically trained continuously using features that are produced by hundreds of real time data processing pipelines or derived from other upstream models. A failure in any of these pipelines or an instability in any of the upstream models can cause feature corruption, causing the model's predicted output to deviate from the actual output and the training data to become corrupted. The causal relationship between the features and the predicted output is complex, and root-causing is challenging due to the scale and dynamism of the system. We demonstrate how temporal shifts in the global feature importance distribution can effectively isolate the cause of a prediction anomaly, with better recall than model-to-feature correlation methods. The technique appears to be effective even when approximating the local feature importance using a simple perturbation-based method, and aggregating over a few thousand examples. We have found this technique to be a model-agnostic, cheap and effective way to monitor complex data pipelines in production and have deployed a system for continuously analyzing the global feature importance distribution of continuously trained models.
- [659] arXiv:2403.02444 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Anatomically Constrained Tractography of the Fetal BrainCamilo Calixto , Camilo Jaimes , Matheus D. Soldatelli , Simon K. Warfield , Ali Gholipour , Davood KarimiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Diffusion-weighted Magnetic Resonance Imaging (dMRI) is increasingly used to study the fetal brain in utero. An important computation enabled by dMRI is streamline tractography, which has unique applications such as tract-specific analysis of the brain white matter and structural connectivity assessment. However, due to the low fetal dMRI data quality and the challenging nature of tractography, existing methods tend to produce highly inaccurate results. They generate many false streamlines while failing to reconstruct streamlines that constitute the major white matter tracts. In this paper, we advocate for anatomically constrained tractography based on an accurate segmentation of the fetal brain tissue directly in the dMRI space. We develop a deep learning method to compute the segmentation automatically. Experiments on independent test data show that this method can accurately segment the fetal brain tissue and drastically improve tractography results. It enables the reconstruction of highly curved tracts such as optic radiations. Importantly, our method infers the tissue segmentation and streamline propagation direction from a diffusion tensor fit to the dMRI data, making it applicable to routine fetal dMRI scans. The proposed method can lead to significant improvements in the accuracy and reproducibility of quantitative assessment of the fetal brain with dMRI.
- [660] arXiv:2403.02484 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Encodings for Prediction-based Neural Architecture SearchSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Abstract: Predictor-based methods have substantially enhanced Neural Architecture Search (NAS) optimization. The efficacy of these predictors is largely influenced by the method of encoding neural network architectures. While traditional encodings used an adjacency matrix describing the graph structure of a neural network, novel encodings embrace a variety of approaches from unsupervised pretraining of latent representations to vectors of zero-cost proxies. In this paper, we categorize and investigate neural encodings from three main types: structural, learned, and score-based. Furthermore, we extend these encodings and introduce \textit{unified encodings}, that extend NAS predictors to multiple search spaces. Our analysis draws from experiments conducted on over 1.5 million neural network architectures on NAS spaces such as NASBench-101 (NB101), NB201, NB301, Network Design Spaces (NDS), and TransNASBench-101. Building on our study, we present our predictor \textbf{FLAN}: \textbf{Fl}ow \textbf{A}ttention for \textbf{N}AS. FLAN integrates critical insights on predictor design, transfer learning, and \textit{unified encodings} to enable more than an order of magnitude cost reduction for training NAS accuracy predictors. Our implementation and encodings for all neural networks are open-sourced at \href{ this https URL }{ this https URL \_nas}.
- [661] arXiv:2403.02495 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Pseudo-Labeling and Contextual Curriculum Learning for Online Grasp Learning in Robotic Bin PickingComments: Accepted to ICRA 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The prevailing grasp prediction methods predominantly rely on offline learning, overlooking the dynamic grasp learning that occurs during real-time adaptation to novel picking scenarios. These scenarios may involve previously unseen objects, variations in camera perspectives, and bin configurations, among other factors. In this paper, we introduce a novel approach, SSL-ConvSAC, that combines semi-supervised learning and reinforcement learning for online grasp learning. By treating pixels with reward feedback as labeled data and others as unlabeled, it efficiently exploits unlabeled data to enhance learning. In addition, we address the imbalance between labeled and unlabeled data by proposing a contextual curriculum-based method. We ablate the proposed approach on real-world evaluation data and demonstrate promise for improving online grasp learning on bin picking tasks using a physical 7-DoF Franka Emika robot arm with a suction gripper. Video: this https URL
- [662] arXiv:2403.02502 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Trial and Error: Exploration-Based Trajectory Optimization for LLM AgentsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Contrary to previous studies that exclusively train on successful expert trajectories, our method allows agents to learn from their exploration failures. This leads to improved performance through an iterative optimization framework. During the exploration phase, the agent interacts with the environment while completing given tasks, gathering failure trajectories to create contrastive trajectory pairs. In the subsequent training phase, the agent utilizes these trajectory preference pairs to update its policy using contrastive learning methods like DPO. This iterative cycle of exploration and training fosters continued improvement in the agents. Our experiments on three complex tasks demonstrate that ETO consistently surpasses baseline performance by a large margin. Furthermore, an examination of task-solving efficiency and potential in scenarios lacking expert trajectory underscores the effectiveness of our approach.
- [663] arXiv:2403.02504 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Tutorial on the Pretrain-Finetune Paradigm for Natural Language ProcessingComments: 16 pages, 6 figures, 2 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The pretrain-finetune paradigm represents a transformative approach in natural language processing (NLP). This paradigm distinguishes itself through the use of large pretrained language models, demonstrating remarkable efficiency in finetuning tasks, even with limited training data. This efficiency is especially beneficial for research in social sciences, where the number of annotated samples is often quite limited. Our tutorial offers a comprehensive introduction to the pretrain-finetune paradigm. We first delve into the fundamental concepts of pretraining and finetuning, followed by practical exercises using real-world applications. We demonstrate the application of the paradigm across various tasks, including multi-class classification and regression. Emphasizing its efficacy and user-friendliness, the tutorial aims to encourage broader adoption of this paradigm. To this end, we have provided open access to all our code and datasets. The tutorial is particularly valuable for quantitative researchers in psychology, offering them an insightful guide into this innovative approach.
- [664] arXiv:2403.02509 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SPUQ: Perturbation-Based Uncertainty Quantification for Large Language ModelsComments: Accepted to appear at EACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, large language models (LLMs) have become increasingly prevalent, offering remarkable text generation capabilities. However, a pressing challenge is their tendency to make confidently wrong predictions, highlighting the critical need for uncertainty quantification (UQ) in LLMs. While previous works have mainly focused on addressing aleatoric uncertainty, the full spectrum of uncertainties, including epistemic, remains inadequately explored. Motivated by this gap, we introduce a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties. The method entails generating a set of perturbations for LLM inputs, sampling outputs for each perturbation, and incorporating an aggregation module that generalizes the sampling uncertainty approach for text generation tasks. Through extensive experiments on various datasets, we investigated different perturbation and aggregation techniques. Our findings show a substantial improvement in model uncertainty calibration, with a reduction in Expected Calibration Error (ECE) by 50\% on average. Our findings suggest that our proposed UQ method offers promising steps toward enhancing the reliability and trustworthiness of LLMs.
- [665] arXiv:2403.02514 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Purpose for Open-Ended Learning Robots: A Computational Taxonomy, Definition, and OperationalisationGianluca Baldassarre , Richard J. Duro , Emilio Cartoni , Mehdi Khamassi , Alejandro Romero , Vieri Giuliano SantucciComments: 15 pages, 6 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Autonomous open-ended learning (OEL) robots are able to cumulatively acquire new skills and knowledge through direct interaction with the environment, for example relying on the guidance of intrinsic motivations and self-generated goals. OEL robots have a high relevance for applications as they can use the autonomously acquired knowledge to accomplish tasks relevant for their human users. OEL robots, however, encounter an important limitation: this may lead to the acquisition of knowledge that is not so much relevant to accomplish the users' tasks. This work analyses a possible solution to this problem that pivots on the novel concept of `purpose'. Purposes indicate what the designers and/or users want from the robot. The robot should use internal representations of purposes, called here `desires', to focus its open-ended exploration towards the acquisition of knowledge relevant to accomplish them. This work contributes to develop a computational framework on purpose in two ways. First, it formalises a framework on purpose based on a three-level motivational hierarchy involving: (a) the purposes; (b) the desires, which are domain independent; (c) specific domain dependent state-goals. Second, the work highlights key challenges highlighted by the framework such as: the `purpose-desire alignment problem', the `purpose-goal grounding problem', and the `arbitration between desires'. Overall, the approach enables OEL robots to learn in an autonomous way but also to focus on acquiring goals and skills that meet the purposes of the designers and users.
- [666] arXiv:2403.02522 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: HeAR -- Health Acoustic RepresentationsSebastien Baur , Zaid Nabulsi , Wei-Hung Weng , Jake Garrison , Louis Blankemeier , Sam Fishman , Christina Chen , Sujay Kakarmath , Minyoi Maimbolwa , Nsala Sanjase , Brian Shuma , Yossi Matias , Greg S. Corrado , Shwetak Patel , Shravya Shetty , Shruthi Prabhakara , Monde Muyoyeta , Diego ArdilaComments: 4 tables, 4 figures, 6 supplementary tables, 3 supplementary figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Health acoustic sounds such as coughs and breaths are known to contain useful health signals with significant potential for monitoring health and disease, yet are underexplored in the medical machine learning community. The existing deep learning systems for health acoustics are often narrowly trained and evaluated on a single task, which is limited by data and may hinder generalization to other tasks. To mitigate these gaps, we develop HeAR, a scalable self-supervised learning-based deep learning system using masked autoencoders trained on a large dataset of 313 million two-second long audio clips. Through linear probes, we establish HeAR as a state-of-the-art health audio embedding model on a benchmark of 33 health acoustic tasks across 6 datasets. By introducing this work, we hope to enable and accelerate further health acoustics research.
- [667] arXiv:2403.02528 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DACO: Towards Application-Driven and Comprehensive Data Analysis via Code GenerationXueqing Wu , Rui Zheng , Jingzhen Sha , Te-Lin Wu , Hanyu Zhou , Mohan Tang , Kai-Wei Chang , Nanyun Peng , Haoran HuangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights to comprehensively answer a given user query for tabular data. In this work, we aim to propose new resources and benchmarks to inspire future research on this crucial yet challenging and under-explored task. However, collecting data analysis annotations curated by experts can be prohibitively expensive. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs with a multi-turn prompting technique. We construct the DACO dataset, containing (1) 440 databases (of tabular data) collected from real-world scenarios, (2) ~2k query-answer pairs that can serve as weak supervision for model training, and (3) a concentrated but high-quality test set with human refined annotations that serves as our main evaluation benchmark. We train a 6B supervised fine-tuning (SFT) model on DACO dataset, and find that the SFT model learns reasonable data analysis capabilities. To further align the models with human preference, we use reinforcement learning to encourage generating analysis perceived by human as helpful, and design a set of dense rewards to propagate the sparse human preference reward to intermediate code generation steps. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases, validating the effectiveness of our proposed algorithm. Data and code are released at this https URL
- [668] arXiv:2403.02545 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Wukong: Towards a Scaling Law for Large-Scale RecommendationBuyun Zhang , Liang Luo , Yuxin Chen , Jade Nie , Xi Liu , Daifeng Guo , Yanli Zhao , Shen Li , Yuchen Hao , Yantao Yao , Guna Lakshminarayanan , Ellie Dingqiao Wen , Jongsoo Park , Maxim Naumov , Wenlin ChenComments: 12 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 Gflop or equivalently up to Large Language Model (GPT-3) training compute scale, where prior arts fall short.
- [669] arXiv:2403.02567 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Eliciting Better Multilingual Structured Reasoning from LLMs through CodeSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Development of large language models (LLM) have shown progress on reasoning, though studies have been limited to English or simple reasoning tasks. We thus introduce a multilingual structured reasoning and explanation dataset, termed xSTREET, that covers four tasks across six languages. xSTREET exposes a gap in base LLM performance between English and non-English reasoning tasks. We then propose two methods to remedy this gap, building on the insight that LLMs trained on code are better reasoners. First, at training time, we augment a code dataset with multi-lingual comments using machine translation while keeping program code as-is. Second, at inference time, we bridge the gap between training and inference by employing a prompt structure that incorporates step-by-step code primitives to derive new facts and find a solution. Our methods show improved multilingual performance on xSTREET, most notably on the scientific commonsense reasoning subtask. Furthermore, the models show no regression on non-reasoning tasks, thus showing our techniques maintain general-purpose abilities.
- [670] arXiv:2403.02574 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature SummaryComments: 18 pages, 5 figuresSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The literature review is an indispensable step in the research process. It provides the benefit of comprehending the research problem and understanding the current research situation while conducting a comparative analysis of prior works. However, literature summary is challenging and time consuming. The previous LLM-based studies on literature review mainly focused on the complete process, including literature retrieval, screening, and summarization. However, for the summarization step, simple CoT method often lacks the ability to provide extensive comparative summary. In this work, we firstly focus on the independent literature summarization step and introduce ChatCite, an LLM agent with human workflow guidance for comparative literature summary. This agent, by mimicking the human workflow, first extracts key elements from relevant literature and then generates summaries using a Reflective Incremental Mechanism. In order to better evaluate the quality of the generated summaries, we devised a LLM-based automatic evaluation metric, G-Score, in refer to the human evaluation criteria. The ChatCite agent outperformed other models in various dimensions in the experiments. The literature summaries generated by ChatCite can also be directly used for drafting literature reviews.
- [671] arXiv:2403.02589 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: MUSIC: Accelerated Convergence for Distributed Optimization With Inexact and Exact MethodsSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI)
Abstract: Gradient-type distributed optimization methods have blossomed into one of the most important tools for solving a minimization learning task over a networked agent system. However, only one gradient update per iteration is difficult to achieve a substantive acceleration of convergence. In this paper, we propose an accelerated framework named as MUSIC allowing each agent to perform multiple local updates and a single combination in each iteration. More importantly, we equip inexact and exact distributed optimization methods into this framework, thereby developing two new algorithms that exhibit accelerated linear convergence and high communication efficiency. Our rigorous convergence analysis reveals the sources of steady-state errors arising from inexact policies and offers effective solutions. Numerical results based on synthetic and real datasets demonstrate both our theoretical motivations and analysis, as well as performance advantages.
- [672] arXiv:2403.02607 (cross-list from cs.GT) [ pdf , ps , other ]
-
Title: MEBS: Multi-task End-to-end Bid Shading for Multi-slot Display AdvertisingZhen Gong , Lvyin Niu , Yang Zhao , Miao Xu , Zhenzhe Zheng , Haoqi Zhang , Zhilin Zhang , Fan Wu , Rongquan Bai , Chuan Yu , Jian Xu , Bo ZhengSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: Online bidding and auction are crucial aspects of the online advertising industry. Conventionally, there is only one slot for ad display and most current studies focus on it. Nowadays, multi-slot display advertising is gradually becoming popular where many ads could be displayed in a list and shown as a whole to users. However, multi-slot display advertising leads to different cost-effectiveness. Advertisers have the incentive to adjust bid prices so as to win the most economical ad positions. In this study, we introduce bid shading into multi-slot display advertising for bid price adjustment with a Multi-task End-to-end Bid Shading(MEBS) method. We prove the optimality of our method theoretically and examine its performance experimentally. Through extensive offline and online experiments, we demonstrate the effectiveness and efficiency of our method, and we obtain a 7.01% lift in Gross Merchandise Volume, a 7.42% lift in Return on Investment, and a 3.26% lift in ad buy count.
- [673] arXiv:2403.02611 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive LearningComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Defocus blur is a persistent problem in microscope imaging that poses harm to pathology interpretation and medical intervention in cell microscopy and microscope surgery. To address this problem, a unified framework including the multi-pyramid transformer (MPT) and extended frequency contrastive regularization (EFCR) is proposed to tackle two outstanding challenges in microscopy deblur: longer attention span and data deficiency. The MPT employs an explicit pyramid structure at each network stage that integrates the cross-scale window attention (CSWA), the intra-scale channel attention (ISCA), and the feature-enhancing feed-forward network (FEFN) to capture long-range cross-scale spatial interaction and global channel context. The EFCR addresses the data deficiency problem by exploring latent deblur signals from different frequency bands. It also enables deblur knowledge transfer to learn cross-domain information from extra data, improving deblur performance for labeled and unlabeled data. Extensive experiments and downstream task validation show the framework achieves state-of-the-art performance across multiple datasets. Project page: this https URL .
- [674] arXiv:2403.02613 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Large Language Models and Video Games: A Preliminary Scoping ReviewComments: under reviewSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) hold interesting potential for the design, development, and research of video games. Building on the decades of prior research on generative AI in games, many researchers have sped to investigate the power and potential of LLMs for games. Given the recent spike in LLM-related research in games, there is already a wealth of relevant research to survey. In order to capture a snapshot of the state of LLM research in games, and to help lay the foundation for future work, we carried out an initial scoping review of relevant papers published so far. In this paper, we review 76 papers published between 2022 to early 2024 on LLMs and video games, with key focus areas in game AI, game development, narrative, and game research and reviews. Our paper provides an early state of the field and lays the groundwork for future research and reviews on this topic.
- [675] arXiv:2403.02616 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Unsupervised Spatio-Temporal State Estimation for Fine-grained Adaptive Anomaly Diagnosis of Industrial Cyber-physical SystemsComments: 23 pages, 7 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Abstract: Accurate detection and diagnosis of abnormal behaviors such as network attacks from multivariate time series (MTS) are crucial for ensuring the stable and effective operation of industrial cyber-physical systems (CPS). However, existing researches pay little attention to the logical dependencies among system working states, and have difficulties in explaining the evolution mechanisms of abnormal signals. To reveal the spatio-temporal association relationships and evolution mechanisms of the working states of industrial CPS, this paper proposes a fine-grained adaptive anomaly diagnosis method (i.e. MAD-Transformer) to identify and diagnose anomalies in MTS. MAD-Transformer first constructs a temporal state matrix to characterize and estimate the change patterns of the system states in the temporal dimension. Then, to better locate the anomalies, a spatial state matrix is also constructed to capture the inter-sensor state correlation relationships within the system. Subsequently, based on these two types of state matrices, a three-branch structure of series-temporal-spatial attention module is designed to simultaneously capture the series, temporal, and space dependencies among MTS. Afterwards, three associated alignment loss functions and a reconstruction loss are constructed to jointly optimize the model. Finally, anomalies are determined and diagnosed by comparing the residual matrices with the original matrices. We conducted comparative experiments on five publicly datasets spanning three application domains (service monitoring, spatial and earth exploration, and water treatment), along with a petroleum refining simulation dataset collected by ourselves. The results demonstrate that MAD-Transformer can adaptively detect fine-grained anomalies with short duration, and outperforms the state-of-the-art baselines in terms of noise robustness and localization performance.
- [676] arXiv:2403.02622 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: World Models for Autonomous Driving: An Initial SurveyYanchen Guan , Haicheng Liao , Zhenning Li , Jia Hu , Runze Yuan , Yunjian Li , Guohui Zhang , Chengzhong XuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: In the rapidly evolving landscape of autonomous driving, the capability to accurately predict future events and assess their implications is paramount for both safety and efficiency, critically aiding the decision-making process. World models have emerged as a transformative approach, enabling autonomous driving systems to synthesize and interpret vast amounts of sensor data, thereby predicting potential future scenarios and compensating for information gaps. This paper provides an initial review of the current state and prospective advancements of world models in autonomous driving, spanning their theoretical underpinnings, practical applications, and the ongoing research efforts aimed at overcoming existing limitations. Highlighting the significant role of world models in advancing autonomous driving technologies, this survey aspires to serve as a foundational reference for the research community, facilitating swift access to and comprehension of this burgeoning field, and inspiring continued innovation and exploration.
- [677] arXiv:2403.02624 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Pareto-Optimal Estimation and Policy Learning on Short-term and Long-term Treatment EffectsYingrong Wang , Anpeng Wu , Haoxuan Li , Weiming Liu , Qiaowei Miao , Ruoxuan Xiong , Fei Wu , Kun KuangSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper focuses on developing Pareto-optimal estimation and policy learning to identify the most effective treatment that maximizes the total reward from both short-term and long-term effects, which might conflict with each other. For example, a higher dosage of medication might increase the speed of a patient's recovery (short-term) but could also result in severe long-term side effects. Although recent works have investigated the problems about short-term or long-term effects or the both, how to trade-off between them to achieve optimal treatment remains an open challenge. Moreover, when multiple objectives are directly estimated using conventional causal representation learning, the optimization directions among various tasks can conflict as well. In this paper, we systematically investigate these issues and introduce a Pareto-Efficient algorithm, comprising Pareto-Optimal Estimation (POE) and Pareto-Optimal Policy Learning (POPL), to tackle them. POE incorporates a continuous Pareto module with representation balancing, enhancing estimation efficiency across multiple tasks. As for POPL, it involves deriving short-term and long-term outcomes linked with various treatment levels, facilitating an exploration of the Pareto frontier emanating from these outcomes. Results on both the synthetic and real-world datasets demonstrate the superiority of our method.
- [678] arXiv:2403.02647 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FinReport: Explainable Stock Earnings Forecasting via News Factor Analyzing ModelComments: Accepted by WWW 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The task of stock earnings forecasting has received considerable attention due to the demand investors in real-world scenarios. However, compared with financial institutions, it is not easy for ordinary investors to mine factors and analyze news. On the other hand, although large language models in the financial field can serve users in the form of dialogue robots, it still requires users to have financial knowledge to ask reasonable questions. To serve the user experience, we aim to build an automatic system, FinReport, for ordinary investors to collect information, analyze it, and generate reports after summarizing.
Specifically, our FinReport is based on financial news announcements and a multi-factor model to ensure the professionalism of the report. The FinReport consists of three modules: news factorization module, return forecasting module, risk assessment module. The news factorization module involves understanding news information and combining it with stock factors, the return forecasting module aim to analysis the impact of news on market sentiment, and the risk assessment module is adopted to control investment risk. Extensive experiments on real-world datasets have well verified the effectiveness and explainability of our proposed FinReport. Our codes and datasets are available at this https URL . - [679] arXiv:2403.02648 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGradSayantan Choudhury , Nazarii Tupitsa , Nicolas Loizou , Samuel Horvath , Martin Takac , Eduard GorbunovComments: 26 pages, 9 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: Adaptive methods are extremely popular in machine learning as they make learning rate tuning less expensive. This paper introduces a novel optimization algorithm named KATE, which presents a scale-invariant adaptation of the well-known AdaGrad algorithm. We prove the scale-invariance of KATE for the case of Generalized Linear Models. Moreover, for general smooth non-convex problems, we establish a convergence rate of $O \left(\frac{\log T}{\sqrt{T}} \right)$ for KATE, matching the best-known ones for AdaGrad and Adam. We also compare KATE to other state-of-the-art adaptive algorithms Adam and AdaGrad in numerical experiments with different problems, including complex machine learning tasks like image classification and text classification on real data. The results indicate that KATE consistently outperforms AdaGrad and matches/surpasses the performance of Adam in all considered scenarios.
- [680] arXiv:2403.02651 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Learning at the Speed of Wireless: Online Real-Time Learning for AI-Enabled MIMO in NextGComments: 7 pages, 4 figures, 1 table, magazine paperSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI)
Abstract: Integration of artificial intelligence (AI) and machine learning (ML) into the air interface has been envisioned as a key technology for next-generation (NextG) cellular networks. At the air interface, multiple-input multiple-output (MIMO) and its variants such as multi-user MIMO (MU-MIMO) and massive/full-dimension MIMO have been key enablers across successive generations of cellular networks with evolving complexity and design challenges. Initiating active investigation into leveraging AI/ML tools to address these challenges for MIMO becomes a critical step towards an AI-enabled NextG air interface. At the NextG air interface, the underlying wireless environment will be extremely dynamic with operation adaptations performed on a sub-millisecond basis by MIMO operations such as MU-MIMO scheduling and rank/link adaptation. Given the enormously large number of operation adaptation possibilities, we contend that online real-time AI/ML-based approaches constitute a promising paradigm. To this end, we outline the inherent challenges and offer insights into the design of such online real-time AI/ML-based solutions for MIMO operations. An online real-time AI/ML-based method for MIMO-OFDM channel estimation is then presented, serving as a potential roadmap for developing similar techniques across various MIMO operations in NextG.
- [681] arXiv:2403.02687 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: Enhanced DareFightingICE Competitions: Sound Design and AI CompetitionsIbrahim Khan , Chollakorn Nimpattanavong , Thai Van Nguyen , Kantinan Plupattanakit , Ruck ThawonmasComments: This paper describes a new competition platform using Unity for our competitions at the 2024 IEEE Conference on Games (CoG 2024). It was accepted for presentation at CoG 2024. However, we recently discovered a much more effective way to do this task without using Unity, leading to our decision to withdraw the paper from CoG 2024 and ArXivSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: This paper presents a new and improved DareFightingICE platform, a fighting game platform with a focus on visually impaired players (VIPs), in the Unity game engine. It also introduces the separation of the DareFightingICE Competition into two standalone competitions called DareFightingICE Sound Design Competition and DareFightingICE AI Competition--at the 2024 IEEE Conference on Games (CoG)--in which a new platform will be used. This new platform is an enhanced version of the old DareFightingICE platform, having a better audio system to convey 3D sound and a better way to send audio data to AI agents. With this enhancement and by utilizing Unity, the new DareFightingICE platform is more accessible in terms of adding new features for VIPs and future audio research. This paper also improves the evaluation method for evaluating sound designs in the Sound Design Competition which will ensure a better sound design for VIPs as this competition continues to run at future CoG. To the best of our knowledge, both of our competitions are first of their kind, and the connection between the competitions to mutually improve the entries' quality with time makes these competitions an important part of representing an often overlooked segment within the broader gaming community, VIPs.
- [682] arXiv:2403.02688 (cross-list from cs.ET) [ pdf , ps , html , other ]
-
Title: DOCTOR: Dynamic On-Chip Remediation Against Temporally-Drifting Thermal Variations Toward Self-Corrected Photonic Tensor AcceleratorsComments: 8 pagesSubjects: Emerging Technologies (cs.ET) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads, offering unparalleled speed and energy efficiency, especially in resource-limited, latency-sensitive edge computing environments. However, the deployment of analog photonic tensor accelerators encounters reliability challenges due to hardware noises and environmental variations. While off-chip noise-aware training and on-chip training have been proposed to enhance the variation tolerance of optical neural accelerators with moderate, static noises, we observe a notable performance degradation over time due to temporally drifting variations, which requires a real-time, in-situ calibration mechanism. To tackle this challenging reliability issues, for the first time, we propose a lightweight dynamic on-chip remediation framework, dubbed DOCTOR, providing adaptive, in-situ accuracy recovery against temporally drifting noises. The DOCTOR framework intelligently monitors the chip status using adaptive probing and performs fast in-situ training-free calibration to restore accuracy when necessary. Recognizing nonuniform spatial variation distributions across devices and tensor cores, we also propose a variation-aware architectural remapping strategy to avoid executing critical tasks on noisy devices. Extensive experiments show that our proposed framework can guarantee sustained performance under drifting variations with 34% higher accuracy and 2-3 orders-of-magnitude lower overhead compared to state-of-the-art on-chip training methods.
- [683] arXiv:2403.02694 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Privacy-Aware Semantic Cache for Large Language ModelsWaris Gill (1), Mohamed Elidrisi (2), Pallavi Kalapatapu (2), Ali Anwar (3), Muhammad Ali Gulzar (1) ((1) Virginia Tech, USA, (2) Cisco, USA (3) University of Minnesota, Minneapolis, USA)Comments: This study presents the first privacy aware semantic cache for LLMs based on Federated Learning. Total pages 13Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Large Language Models (LLMs) like ChatGPT and Llama2 have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters where inference demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries which constitute about 31% of the total queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries, leading to unacceptable false hit-and-miss rates.
This paper introduces MeanCache, a user-centric semantic cache for LLMs that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user's semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. Existing caching solutions for LLMs raise privacy and scalability concerns and perform wasteful query requests. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model across LLM users without violating privacy. By placing a local cache in each user's device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower false hit rates. MeanCache compresses the embedding dimensions to minimize cache storage and also finds the optimal cosine similarity threshold. Our experiments benchmarked against the state-of-the-art caching method, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions. It also reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%. - [684] arXiv:2403.02701 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Fighting Game Adaptive Background Music for Improved GameplayComments: This is an updated version of our IEEE CoG 2023 paper ( this https URL ). This version has revised the description of the association between the distance between the two players (PD) and the instrument's volume on page 2. arXiv admin note: substantial text overlap with arXiv:2303.15734Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: This paper presents our work to enhance the background music (BGM) in DareFightingICE by adding adaptive features. The adaptive BGM consists of three different categories of instruments playing the BGM of the winner sound design from the 2022 DareFightingICE Competition. The BGM adapts by changing the volume of each category of instruments. Each category is connected to a different element of the game. We then run experiments to evaluate the adaptive BGM by using a deep reinforcement learning AI agent that only uses audio as input (Blind DL AI). The results show that the performance of the Blind DL AI improves while playing with the adaptive BGM as compared to playing without the adaptive BGM.
- [685] arXiv:2403.02715 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language ModelsComments: 33 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 common tasks and 31 metrics. Our evaluation results reveal that the fine-tuned LLMs exhibit enhanced comprehension and generative capabilities in Vietnamese. Moreover, our analysis indicates that models with more parameters can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or fine-tuning datasets. These insights underscore the significance of meticulous fine-tuning with high-quality datasets in enhancing LLM performance.
- [686] arXiv:2403.02726 (cross-list from econ.GN) [ pdf , ps , other ]
-
Title: Bias in Generative AISubjects: General Economics (econ.GN) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: This study analyzed images generated by three popular generative artificial intelligence (AI) tools - Midjourney, Stable Diffusion, and DALLE 2 - representing various occupations to investigate potential bias in AI generators. Our analysis revealed two overarching areas of concern in these AI generators, including (1) systematic gender and racial biases, and (2) subtle biases in facial expressions and appearances. Firstly, we found that all three AI generators exhibited bias against women and African Americans. Moreover, we found that the evident gender and racial biases uncovered in our analysis were even more pronounced than the status quo when compared to labor force statistics or Google images, intensifying the harmful biases we are actively striving to rectify in our society. Secondly, our study uncovered more nuanced prejudices in the portrayal of emotions and appearances. For example, women were depicted as younger with more smiles and happiness, while men were depicted as older with more neutral expressions and anger, posing a risk that generative AI models may unintentionally depict women as more submissive and less competent than men. Such nuanced biases, by their less overt nature, might be more problematic as they can permeate perceptions unconsciously and may be more difficult to rectify. Although the extent of bias varied depending on the model, the direction of bias remained consistent in both commercial and open-source AI generators. As these tools become commonplace, our study highlights the urgency to identify and mitigate various biases in generative AI, reinforcing the commitment to ensuring that AI technologies benefit all of humanity for a more inclusive future.
- [687] arXiv:2403.02727 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: HARGPT: Are LLMs Zero-Shot Human Activity Recognizers?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: There is an ongoing debate regarding the potential of Large Language Models (LLMs) as foundational models seamlessly integrated with Cyber-Physical Systems (CPS) for interpreting the physical world. In this paper, we carry out a case study to answer the following question: Are LLMs capable of zero-shot human activity recognition (HAR). Our study, HARGPT, presents an affirmative answer by demonstrating that LLMs can comprehend raw IMU data and perform HAR tasks in a zero-shot manner, with only appropriate prompts. HARGPT inputs raw IMU data into LLMs and utilizes the role-play and think step-by-step strategies for prompting. We benchmark HARGPT on GPT4 using two public datasets of different inter-class similarities and compare various baselines both based on traditional machine learning and state-of-the-art deep classification models. Remarkably, LLMs successfully recognize human activities from raw IMU data and consistently outperform all the baselines on both datasets. Our findings indicate that by effective prompting, LLMs can interpret raw IMU data based on their knowledge base, possessing a promising potential to analyze raw sensor data of the physical world effectively.
- [688] arXiv:2403.02736 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Bootstrapping Rare Object Detection in High-Resolution Satellite ImageryAkram Zaytar , Caleb Robinson , Gilles Q. Hacheme , Girmaw A. Tadesse , Rahul Dodhia , Juan M. Lavista Ferres , Lacey F. Hughey , Jared A. Stabach , Irene AmokeSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Rare object detection is a fundamental task in applied geospatial machine learning, however is often challenging due to large amounts of high-resolution satellite or aerial imagery and few or no labeled positive samples to start with. This paper addresses the problem of bootstrapping such a rare object detection task assuming there is no labeled data and no spatial prior over the area of interest. We propose novel offline and online cluster-based approaches for sampling patches that are significantly more efficient, in terms of exposing positive samples to a human annotator, than random sampling. We apply our methods for identifying bomas, or small enclosures for herd animals, in the Serengeti Mara region of Kenya and Tanzania. We demonstrate a significant enhancement in detection efficiency, achieving a positive sampling rate increase from 2% (random) to 30%. This advancement enables effective machine learning mapping even with minimal labeling budgets, exemplified by an F1 score on the boma detection task of 0.51 with a budget of 300 total patches.
- [689] arXiv:2403.02750 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Speckle Noise Reduction in Ultrasound Images using Denoising Auto-encoder with Skip ConnectionComments: Selected for presentation at 2024 IEEE South Asian Ultrasonics SymposiumSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
Abstract: Ultrasound is a widely used medical tool for non-invasive diagnosis, but its images often contain speckle noise which can lower their resolution and contrast-to-noise ratio. This can make it more difficult to extract, recognize, and analyze features in the images, as well as impair the accuracy of computer-assisted diagnostic techniques and the ability of doctors to interpret the images. Reducing speckle noise, therefore, is a crucial step in the preprocessing of ultrasound images. Researchers have proposed several speckle reduction methods, but no single method takes all relevant factors into account. In this paper, we compare seven such methods: Median, Gaussian, Bilateral, Average, Weiner, Anisotropic and Denoising auto-encoder without and with skip connections in terms of their ability to preserve features and edges while effectively reducing noise. In an experimental study, a convolutional noise-removing auto-encoder with skip connection, a deep learning method, was used to improve ultrasound images of breast cancer. This method involved adding speckle noise at various levels. The results of the deep learning method were compared to those of traditional image enhancement methods, and it was found that the proposed method was more effective. To assess the performance of these algorithms, we use three established evaluation metrics and present both filtered images and statistical data.
- [690] arXiv:2403.02772 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Rehabilitation Exercise Quality Assessment through Supervised Contrastive Learning with Hard and Soft NegativesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Abstract: Exercise-based rehabilitation programs have proven to be effective in enhancing the quality of life and reducing mortality and rehospitalization rates. AI-driven virtual rehabilitation, which allows patients to independently complete exercises at home, utilizes AI algorithms to analyze exercise data, providing feedback to patients and updating clinicians on their progress. These programs commonly prescribe a variety of exercise types, leading to a distinct challenge in rehabilitation exercise assessment datasets: while abundant in overall training samples, these datasets often have a limited number of samples for each individual exercise type. This disparity hampers the ability of existing approaches to train generalizable models with such a small sample size per exercise. Addressing this issue, our paper introduces a novel supervised contrastive learning framework with hard and soft negative samples that effectively utilizes the entire dataset to train a single model applicable to all exercise types. This model, with a Spatial-Temporal Graph Convolutional Network (ST-GCN) architecture, demonstrated enhanced generalizability across exercises and a decrease in overall complexity. Through extensive experiments on three publicly available rehabilitation exercise assessment datasets, the University of Idaho-Physical Rehabilitation Movement Data (UI-PRMD), IntelliRehabDS (IRDS), and KInematic assessment of MOvement and clinical scores for remote monitoring of physical REhabilitation (KIMORE), our method has shown to surpass existing methods, setting a new benchmark in rehabilitation exercise assessment accuracy.
- [691] arXiv:2403.02786 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Semi-Supervised Graph Representation Learning with Human-centric Explanation for Predicting Fatty Liver DiseaseComments: Paper accepted in Human-Centric Representation Learning workshop at AAAI 2024 ( this https URL )Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Addressing the challenge of limited labeled data in clinical settings, particularly in the prediction of fatty liver disease, this study explores the potential of graph representation learning within a semi-supervised learning framework. Leveraging graph neural networks (GNNs), our approach constructs a subject similarity graph to identify risk patterns from health checkup data. The effectiveness of various GNN approaches in this context is demonstrated, even with minimal labeled samples. Central to our methodology is the inclusion of human-centric explanations through explainable GNNs, providing personalized feature importance scores for enhanced interpretability and clinical relevance, thereby underscoring the potential of our approach in advancing healthcare practices with a keen focus on graph representation learning and human-centric explanation.
- [692] arXiv:2403.02794 (cross-list from cs.IR) [ pdf , ps , other ]
-
Title: A Distance Metric Learning Model Based On Variational Information BottleneckSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In recent years, personalized recommendation technology has flourished and become one of the hot research directions. The matrix factorization model and the metric learning model which proposed successively have been widely studied and applied. The latter uses the Euclidean distance instead of the dot product used by the former to measure the latent space vector. While avoiding the shortcomings of the dot product, the assumption of Euclidean distance is neglected, resulting in limited recommendation quality of the model. In order to solve this problem, this paper combines the Variationl Information Bottleneck with metric learning model for the first time, and proposes a new metric learning model VIB-DML (Variational Information Bottleneck Distance Metric Learning) for rating prediction, which limits the mutual information of the latent space feature vector to improve the robustness of the model and satisfiy the assumption of Euclidean distance by decoupling the latent space feature vector. In this paper, the experimental results are compared with the root mean square error (RMSE) on the three public datasets. The results show that the generalization ability of VIB-DML is excellent. Compared with the general metric learning model MetricF, the prediction error is reduced by 7.29%. Finally, the paper proves the strong robustness of VIBDML through experiments.
- [693] arXiv:2403.02799 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DPPA: Pruning Method for Large Language Model to Model MergingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Model merging is to combine fine-tuned models derived from multiple domains, with the intent of enhancing the model's proficiency across various domains. The principal concern is the resolution of parameter conflicts. A substantial amount of existing research remedy this issue during the merging stage, with the latest study focusing on resolving this issue throughout the pruning stage. The DARE approach has exhibited promising outcomes when applied to a simplistic fine-tuned model. However, the efficacy of this method tends to wane when employed on complex fine-tuned models that show a significant parameter bias relative to the baseline model. In this paper, we introduce a dual-stage method termed Dynamic Pruning Partition Amplification (DPPA), devised to tackle the challenge of merging complex fine-tuned models. Initially, we introduce Dynamically Pruning (DP), an improved approach based on magnitude pruning, which aim is to enhance performance at higher pruning rates. Subsequently, we propose Dynamically Partition Amplification (DPA), a rescaling strategy, is designed to dynamically amplify parameter partitions in relation to their significance levels. The experimental results show that our method maintains a mere 20% of domain-specific parameters and yet delivers a performance comparable to other methodologies that preserve up to 90% of parameters. Furthermore, our method displays outstanding performance post-pruning, leading to a significant improvement of nearly 20% performance in model merging. We make our code on Github.
- [694] arXiv:2403.02810 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Dynamic Gaussian Graph Operator: Learning parametric partial differential equations in arbitrary discrete mechanics problemsComments: The number of figures is 13. The number of tables is 7. The number of words is 9854Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep learning methods have access to be employed for solving physical systems governed by parametric partial differential equations (PDEs) due to massive scientific data. It has been refined to operator learning that focuses on learning non-linear mapping between infinite-dimensional function spaces, offering interface from observations to solutions. However, state-of-the-art neural operators are limited to constant and uniform discretization, thereby leading to deficiency in generalization on arbitrary discretization schemes for computational domain. In this work, we propose a novel operator learning algorithm, referred to as Dynamic Gaussian Graph Operator (DGGO) that expands neural operators to learning parametric PDEs in arbitrary discrete mechanics problems. The Dynamic Gaussian Graph (DGG) kernel learns to map the observation vectors defined in general Euclidean space to metric vectors defined in high-dimensional uniform metric space. The DGG integral kernel is parameterized by Gaussian kernel weighted Riemann sum approximating and using dynamic message passing graph to depict the interrelation within the integral term. Fourier Neural Operator is selected to localize the metric vectors on spatial and frequency domains. Metric vectors are regarded as located on latent uniform domain, wherein spatial and spectral transformation offer highly regular constraints on solution space. The efficiency and robustness of DGGO are validated by applying it to solve numerical arbitrary discrete mechanics problems in comparison with mainstream neural operators. Ablation experiments are implemented to demonstrate the effectiveness of spatial transformation in the DGG kernel. The proposed method is utilized to forecast stress field of hyper-elastic material with geometrically variable void as engineering application.
- [695] arXiv:2403.02814 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: InjectTST: A Transformer Method of Injecting Global Information into Independent Channels for Long Time Series ForecastingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Transformer has become one of the most popular architectures for multivariate time series (MTS) forecasting. Recent Transformer-based MTS models generally prefer channel-independent structures with the observation that channel independence can alleviate noise and distribution drift issues, leading to more robustness. Nevertheless, it is essential to note that channel dependency remains an inherent characteristic of MTS, carrying valuable information. Designing a model that incorporates merits of both channel-independent and channel-mixing structures is a key to further improvement of MTS forecasting, which poses a challenging conundrum. To address the problem, an injection method for global information into channel-independent Transformer, InjectTST, is proposed in this paper. Instead of designing a channel-mixing model directly, we retain the channel-independent backbone and gradually inject global information into individual channels in a selective way. A channel identifier, a global mixing module and a self-contextual attention module are devised in InjectTST. The channel identifier can help Transformer distinguish channels for better representation. The global mixing module produces cross-channel global information. Through the self-contextual attention module, the independent channels can selectively concentrate on useful global information without robustness degradation, and channel mixing is achieved implicitly. Experiments indicate that InjectTST can achieve stable improvement compared with state-of-the-art models.
- [696] arXiv:2403.02846 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FLGuard: Byzantine-Robust Federated Learning via Ensemble of Contrastive ModelsComments: Accepted by 28th European Symposium on Research in Computer Security (ESORICS 2023)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated Learning (FL) thrives in training a global model with numerous clients by only sharing the parameters of their local models trained with their private training datasets. Therefore, without revealing the private dataset, the clients can obtain a deep learning (DL) model with high performance. However, recent research proposed poisoning attacks that cause a catastrophic loss in the accuracy of the global model when adversaries, posed as benign clients, are present in a group of clients. Therefore, recent studies suggested byzantine-robust FL methods that allow the server to train an accurate global model even with the adversaries present in the system. However, many existing methods require the knowledge of the number of malicious clients or the auxiliary (clean) dataset or the effectiveness reportedly decreased hugely when the private dataset was non-independently and identically distributed (non-IID). In this work, we propose FLGuard, a novel byzantine-robust FL method that detects malicious clients and discards malicious local updates by utilizing the contrastive learning technique, which showed a tremendous improvement as a self-supervised learning method. With contrastive models, we design FLGuard as an ensemble scheme to maximize the defensive capability. We evaluate FLGuard extensively under various poisoning attacks and compare the accuracy of the global model with existing byzantine-robust FL methods. FLGuard outperforms the state-of-the-art defense methods in most cases and shows drastic improvement, especially in non-IID settings. this https URL
- [697] arXiv:2403.02877 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous DrivingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: End-to-end differentiable learning for autonomous driving (AD) has recently become a prominent paradigm. One main bottleneck lies in its voracious appetite for high-quality labeled data e.g. 3D bounding boxes and semantic segmentation, which are notoriously expensive to manually annotate. The difficulty is further pronounced due to the prominent fact that the behaviors within samples in AD often suffer from long tailed distribution. In other words, a large part of collected data can be trivial (e.g. simply driving forward in a straight road) and only a few cases are safety-critical. In this paper, we explore a practically important yet under-explored problem about how to achieve sample and label efficiency for end-to-end AD. Specifically, we design a planning-oriented active learning method which progressively annotates part of collected raw data according to the proposed diversity and usefulness criteria for planning routes. Empirically, we show that our planning-oriented approach could outperform general active learning methods by a large margin. Notably, our method achieves comparable performance with state-of-the-art end-to-end AD methods - by using only 30% nuScenes data. We hope our work could inspire future works to explore end-to-end AD from a data-centric perspective in addition to methodology efforts.
- [698] arXiv:2403.02884 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MathScale: Scaling Instruction Tuning for Mathematical ReasoningComments: Work in progressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.
- [699] arXiv:2403.02892 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing Long-Term Person Re-Identification Using Global, Local Body Part, and Head StreamsComments: 16 pagesJournal-ref: Neurocomputing, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This work addresses the task of long-term person re-identification. Typically, person re-identification assumes that people do not change their clothes, which limits its applications to short-term scenarios. To overcome this limitation, we investigate long-term person re-identification, which considers both clothes-changing and clothes-consistent scenarios. In this paper, we propose a novel framework that effectively learns and utilizes both global and local information. The proposed framework consists of three streams: global, local body part, and head streams. The global and head streams encode identity-relevant information from an entire image and a cropped image of the head region, respectively. Both streams encode the most distinct, less distinct, and average features using the combinations of adversarial erasing, max pooling, and average pooling. The local body part stream extracts identity-related information for each body part, allowing it to be compared with the same body part from another image. Since body part annotations are not available in re-identification datasets, pseudo-labels are generated using clustering. These labels are then utilized to train a body part segmentation head in the local body part stream. The proposed framework is trained by backpropagating the weighted summation of the identity classification loss, the pair-based loss, and the pseudo body part segmentation loss. To demonstrate the effectiveness of the proposed method, we conducted experiments on three publicly available datasets (Celeb-reID, PRCC, and VC-Clothes). The experimental results demonstrate that the proposed method outperforms the previous state-of-the-art method.
- [700] arXiv:2403.02893 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Zero-Shot Cross-Lingual Document-Level Event Causality Identification with Heterogeneous Graph Contrastive Transfer LearningZhitao He , Pengfei Cao , Zhuoran Jin , Yubo Chen , Kang Liu , Zhiqiang Zhang , Mengshu Sun , Jun ZhaoComments: Accepted at LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Event Causality Identification (ECI) refers to the detection of causal relations between events in texts. However, most existing studies focus on sentence-level ECI with high-resource languages, leaving more challenging document-level ECI (DECI) with low-resource languages under-explored. In this paper, we propose a Heterogeneous Graph Interaction Model with Multi-granularity Contrastive Transfer Learning (GIMC) for zero-shot cross-lingual document-level ECI. Specifically, we introduce a heterogeneous graph interaction network to model the long-distance dependencies between events that are scattered over a document. Then, to improve cross-lingual transferability of causal knowledge learned from the source language, we propose a multi-granularity contrastive transfer learning module to align the causal representations across languages. Extensive experiments show our framework outperforms the previous state-of-the-art model by 9.4% and 8.2% of average F1 score on monolingual and multilingual scenarios respectively. Notably, in the multilingual scenario, our zero-shot framework even exceeds GPT-3.5 with few-shot learning by 24.3% in overall performance.
- [701] arXiv:2403.02910 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ImgTrojan: Jailbreaking Vision-Language Models with ONE ImageSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: There has been an increasing interest in the alignment of large language models (LLMs) with human values. However, the safety issues of their integration with a vision module, or vision language models (VLMs), remain relatively underexplored. In this paper, we propose a novel jailbreaking attack against VLMs, aiming to bypass their safety barrier when a user inputs harmful instructions. A scenario where our poisoned (image, text) data pairs are included in the training data is assumed. By replacing the original textual captions with malicious jailbreak prompts, our method can perform jailbreak attacks with the poisoned images. Moreover, we analyze the effect of poison ratios and positions of trainable parameters on our attack's success rate. For evaluation, we design two metrics to quantify the success rate and the stealthiness of our attack. Together with a list of curated harmful instructions, a benchmark for measuring attack efficacy is provided. We demonstrate the efficacy of our attack by comparing it with baseline methods.
- [702] arXiv:2403.02920 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-SoftmaxSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The quadratic complexity of the attention mechanism represents one of the biggest hurdles for processing long sequences using Transformers. Current methods, relying on sparse representations or stateful recurrence, sacrifice token-to-token interactions, which ultimately leads to compromises in performance. This paper introduces TaylorShift, a novel reformulation of the Taylor softmax that enables computing full token-to-token interactions in linear time and space. We analytically determine the crossover points where employing TaylorShift becomes more efficient than traditional attention, aligning closely with empirical measurements. Specifically, our findings demonstrate that TaylorShift enhances memory efficiency for sequences as short as 800 tokens and accelerates inference for inputs of approximately 1700 tokens and beyond. For shorter sequences, TaylorShift scales comparably with the vanilla attention. Furthermore, a classification benchmark across five tasks involving long sequences reveals no degradation in accuracy when employing Transformers equipped with TaylorShift. For reproducibility, we provide access to our code under this https URL .
- [703] arXiv:2403.02939 (cross-list from cs.DL) [ pdf , ps , other ]
-
Title: PaperWeaver: Enriching Topical Paper Alerts by Contextualizing Recommended Papers with User-collected PapersYoonjoo Lee , Hyeonsu B. Kang , Matt Latzke , Juho Kim , Jonathan Bragg , Joseph Chee Chang , Pao SiangliulueComments: Accepted to CHI 2024Subjects: Digital Libraries (cs.DL) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Abstract: With the rapid growth of scholarly archives, researchers subscribe to "paper alert" systems that periodically provide them with recommendations of recently published papers that are similar to previously collected papers. However, researchers sometimes struggle to make sense of nuanced connections between recommended papers and their own research context, as existing systems only present paper titles and abstracts. To help researchers spot these connections, we present PaperWeaver, an enriched paper alerts system that provides contextualized text descriptions of recommended papers based on user-collected papers. PaperWeaver employs a computational method based on Large Language Models (LLMs) to infer users' research interests from their collected papers, extract context-specific aspects of papers, and compare recommended and collected papers on these aspects. Our user study (N=15) showed that participants using PaperWeaver were able to better understand the relevance of recommended papers and triage them more confidently when compared to a baseline that presented the related work sections from recommended papers.
- [704] arXiv:2403.02951 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive EvaluationBin Zhang , Yuxiao Ye , Guoqing Du , Xiaoru Hu , Zhishuai Li , Sun Yang , Chi Harold Liu , Rui Zhao , Ziyue Li , Hangyu MaoComments: 26pages, 6figures, 14tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods. Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt templates and design frameworks. Additionally, existing benchmarks inadequately explore the performance of LLMs across the various sub-tasks of the Text-to-SQL process, which hinders the assessment of LLMs' cognitive capabilities and the optimization of LLM-based solutions. To address the aforementioned issues, we firstly construct a new dataset designed to mitigate the risk of overfitting in LLMs. Then we formulate five evaluation tasks to comprehensively assess the performance of diverse methods across various LLMs throughout the Text-to-SQL process.Our study highlights the performance disparities among LLMs and proposes optimal in-context learning solutions tailored to each task. These findings offer valuable insights for enhancing the development of LLM-based Text-to-SQL systems.
- [705] arXiv:2403.02959 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SimuCourt: Building Judicial Decision-Making Agents with Real-world Judgement DocumentsZhitao He , Pengfei Cao , Chenhao Wang , Zhuoran Jin , Yubo Chen , Jiexin Xu , Huaijun Li , Xiaojian Jiang , Kang Liu , Jun ZhaoSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: With the development of deep learning, natural language processing technology has effectively improved the efficiency of various aspects of the traditional judicial industry. However, most current efforts focus solely on individual judicial stage, overlooking cross-stage collaboration. As the autonomous agents powered by large language models are becoming increasingly smart and able to make complex decisions in real-world settings, offering new insights for judicial intelligence. In this paper, (1) we introduce SimuCourt, a judicial benchmark that encompasses 420 judgment documents from real-world, spanning the three most common types of judicial cases, and a novel task Judicial Decision-Making to evaluate the judicial analysis and decision-making power of agents. To support this task, we construct a large-scale judicial knowledge base, JudicialKB, with multiple legal knowledge. (2) we propose a novel multi-agent framework, AgentsCourt. Our framework follows the real-world classic court trial process, consisting of court debate simulation, legal information retrieval and judgement refinement to simulate the decision-making of judge. (3) we perform extensive experiments, the results demonstrate that, our framework outperforms the existing advanced methods in various aspects, especially in generating legal grounds, where our model achieves significant improvements of 8.6% and 9.1% F1 score in the first and second instance settings, respectively.
- [706] arXiv:2403.02965 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ChatGPT and biometrics: an assessment of face recognition, gender detection, and age estimation capabilitiesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper explores the application of large language models (LLMs), like ChatGPT, for biometric tasks. We specifically examine the capabilities of ChatGPT in performing biometric-related tasks, with an emphasis on face recognition, gender detection, and age estimation. Since biometrics are considered as sensitive information, ChatGPT avoids answering direct prompts, and thus we crafted a prompting strategy to bypass its safeguard and evaluate the capabilities for biometrics tasks. Our study reveals that ChatGPT recognizes facial identities and differentiates between two facial images with considerable accuracy. Additionally, experimental results demonstrate remarkable performance in gender detection and reasonable accuracy for the age estimation tasks. Our findings shed light on the promising potentials in the application of LLMs and foundation models for biometrics.
- [707] arXiv:2403.02966 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question AnsweringSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent studies have investigated utilizing Knowledge Graphs (KGs) to enhance Quesetion Answering (QA) performance of Large Language Models (LLMs), yet structured KG verbalization remains challengin. Existing methods, such as triple-form or free-form textual conversion of triple-form facts, encounter several issues. These include reduced evidence density due to duplicated entities or relationships, and reduced evidence clarity due to an inability to emphasize crucial evidence. To address these issues, we propose EFSum, an Evidence-focused Fact Summarization framework for enhanced QA with knowledge-augmented LLMs. We optimize an open-source LLM as a fact summarizer through distillation and preference alignment. Our extensive experiments show that EFSum improves LLM's zero-shot QA performance, and it is possible to ensure both the helpfulness and faithfulness of the summary.
- [708] arXiv:2403.02975 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic MatchingComments: arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submissionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Sentence semantic matching is a research hotspot in natural language processing, which is considerably significant in various key scenarios, such as community question answering, searching, chatbot, and recommendation. Since most of the advanced models directly model the semantic relevance among words between two sentences while neglecting the \textit{keywords} and \textit{intents} concepts of them, DC-Match is proposed to disentangle keywords from intents and utilizes them to optimize the matching performance. Although DC-Match is a simple yet effective method for semantic matching, it highly depends on the external NER techniques to identify the keywords of sentences, which limits the performance of semantic matching for minor languages since satisfactory NER tools are usually hard to obtain. In this paper, we propose to generally and flexibly resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. To this end, we devise a \underline{M}ulti-\underline{C}oncept \underline{P}arsed \underline{S}emantic \underline{M}atching framework based on the pre-trained language models, abbreviated as \textbf{MCP-SM}, to extract various concepts and infuse them into the classification tokens. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM. Besides, we experiment on Arabic datasets MQ2Q and XNLI, the outstanding performance further prove MCP-SM's applicability in low-resource languages.
- [709] arXiv:2403.02983 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Federated Learning Under Attack: Exposing Vulnerabilities through Data Poisoning Attacks in Computer NetworksSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Abstract: Federated Learning (FL) is a machine learning (ML) approach that enables multiple decentralized devices or edge servers to collaboratively train a shared model without exchanging raw data. During the training and sharing of model updates between clients and servers, data and models are susceptible to different data-poisoning attacks.
In this study, our motivation is to explore the severity of data poisoning attacks in the computer network domain because they are easy to implement but difficult to detect. We considered two types of data-poisoning attacks, label flipping (LF) and feature poisoning (FP), and applied them with a novel approach. In LF, we randomly flipped the labels of benign data and trained the model on the manipulated data. For FP, we randomly manipulated the highly contributing features determined using the Random Forest algorithm. The datasets used in this experiment were CIC and UNSW related to computer networks. We generated adversarial samples using the two attacks mentioned above, which were applied to a small percentage of datasets. Subsequently, we trained and tested the accuracy of the model on adversarial datasets. We recorded the results for both benign and manipulated datasets and observed significant differences between the accuracy of the models on different datasets. From the experimental results, it is evident that the LF attack failed, whereas the FP attack showed effective results, which proved its significance in fooling a server. With a 1% LF attack on the CIC, the accuracy was approximately 0.0428 and the ASR was 0.9564; hence, the attack is easily detectable, while with a 1% FP attack, the accuracy and ASR were both approximately 0.9600, hence, FP attacks are difficult to detect. We repeated the experiment with different poisoning percentages. - [710] arXiv:2403.02990 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and ChallengesBosheng Ding , Chengwei Qin , Ruochen Zhao , Tianze Luo , Xinze Li , Guizhen Chen , Wenhan Xia , Junjie Hu , Anh Tuan Luu , Shafiq JotySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In the rapidly evolving field of machine learning (ML), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of Large Language Models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From a data perspective and a learning perspective, we examine various strategies that utilize Large Language Models for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for further training. Additionally, this paper delineates the primary challenges faced in this domain, ranging from controllable data augmentation to multi modal data augmentation. This survey highlights the paradigm shift introduced by LLMs in DA, aims to serve as a foundational guide for researchers and practitioners in this field.
- [711] arXiv:2403.02995 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Mitigating Label Flipping Attacks in Malicious URL Detectors Using Ensemble TreesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Abstract: Malicious URLs provide adversarial opportunities across various industries, including transportation, healthcare, energy, and banking which could be detrimental to business operations. Consequently, the detection of these URLs is of crucial importance; however, current Machine Learning (ML) models are susceptible to backdoor attacks. These attacks involve manipulating a small percentage of training data labels, such as Label Flipping (LF), which changes benign labels to malicious ones and vice versa. This manipulation results in misclassification and leads to incorrect model behavior. Therefore, integrating defense mechanisms into the architecture of ML models becomes an imperative consideration to fortify against potential attacks.
The focus of this study is on backdoor attacks in the context of URL detection using ensemble trees. By illuminating the motivations behind such attacks, highlighting the roles of attackers, and emphasizing the critical importance of effective defense strategies, this paper contributes to the ongoing efforts to fortify ML models against adversarial threats within the ML domain in network security. We propose an innovative alarm system that detects the presence of poisoned labels and a defense mechanism designed to uncover the original class labels with the aim of mitigating backdoor attacks on ensemble tree classifiers. We conducted a case study using the Alexa and Phishing Site URL datasets and showed that LF attacks can be addressed using our proposed defense mechanism. Our experimental results prove that the LF attack achieved an Attack Success Rate (ASR) between 50-65% within 2-5%, and the innovative defense method successfully detected poisoned labels with an accuracy of up to 100%. - [712] arXiv:2403.03002 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Mem-elements based Neuromorphic Hardware for Neural Network ApplicationComments: Master's ThesisSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: The thesis investigates the utilization of memristive and memcapacitive crossbar arrays in low-power machine learning accelerators, offering a comprehensive co-design framework for deep neural networks (DNN). The model, implemented through a hybrid Python and PyTorch approach, accounts for various non-idealities, achieving exceptional training accuracies of 90.02% and 91.03% for the CIFAR-10 dataset with memristive and memcapacitive crossbar arrays on an 8-layer VGG network. Additionally, the thesis introduces a novel approach to emulate meminductor devices using Operational Transconductance Amplifiers (OTA) and capacitors, showcasing adjustable behavior. Transistor-level simulations in 180 nm CMOS technology, operating at 60 MHz, demonstrate the proposed meminductor emulator's viability with a power consumption of 0.337 mW. The design is further validated in neuromorphic circuits and CNN accelerators, achieving training and testing accuracies of 91.04% and 88.82%, respectively. Notably, the exclusive use of MOS transistors ensures the feasibility of monolithic IC fabrication. This research significantly contributes to the exploration of advanced hardware solutions for efficient and high-performance machine-learning applications.
- [713] arXiv:2403.03020 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: SplAgger: Split Aggregation for Meta-Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: A core ambition of reinforcement learning (RL) is the creation of agents capable of rapid learning in novel tasks. Meta-RL aims to achieve this by directly learning such agents. Black box methods do so by training off-the-shelf sequence models end-to-end. By contrast, task inference methods explicitly infer a posterior distribution over the unknown task, typically using distinct objectives and sequence models designed to enable task inference. Recent work has shown that task inference methods are not necessary for strong performance. However, it remains unclear whether task inference sequence models are beneficial even when task inference objectives are not. In this paper, we present strong evidence that task inference sequence models are still beneficial. In particular, we investigate sequence models with permutation invariant aggregation, which exploit the fact that, due to the Markov property, the task posterior does not depend on the order of data. We empirically confirm the advantage of permutation invariant sequence models without the use of task inference objectives. However, we also find, surprisingly, that there are multiple conditions under which permutation variance remains useful. Therefore, we propose SplAgger, which uses both permutation variant and invariant components to achieve the best of both worlds, outperforming all baselines on continuous control and memory environments.
- [714] arXiv:2403.03030 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Unifying Controller Design for Stabilizing Nonlinear Systems with Norm-Bounded Control InputsSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: This paper revisits a classical challenge in the design of stabilizing controllers for nonlinear systems with a norm-bounded input constraint. By extending Lin-Sontag's universal formula and introducing a generic (state-dependent) scaling term, a unifying controller design method is proposed. The incorporation of this generic scaling term gives a unified controller and enables the derivation of alternative universal formulas with various favorable properties, which makes it suitable for tailored control designs to meet specific requirements and provides versatility across different control scenarios. Additionally, we present a constructive approach to determine the optimal scaling term, leading to an explicit solution to an optimization problem, named optimization-based universal formula. The resulting controller ensures asymptotic stability, satisfies a norm-bounded input constraint, and optimizes a predefined cost function. Finally, the essential properties of the unified controllers are analyzed, including smoothness, continuity at the origin, stability margin, and inverse optimality. Simulations validate the approach, showcasing its effectiveness in addressing a challenging stabilizing control problem of a nonlinear system.
- [715] arXiv:2403.03053 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Neural Codebook Design for Network Beam ManagementComments: To be submitted to IEEE Transactions on Wireless CommunicationsSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Abstract: Obtaining accurate and timely channel state information (CSI) is a fundamental challenge for large antenna systems. Mobile systems like 5G use a beam management framework that joins the initial access, beamforming, CSI acquisition, and data transmission. The design of codebooks for these stages, however, is challenging due to their interrelationships, varying array sizes, and site-specific channel and user distributions. Furthermore, beam management is often focused on single-sector operations while ignoring the overarching network- and system-level optimization. In this paper, we proposed an end-to-end learned codebook design algorithm, network beamspace learning (NBL), that captures and optimizes codebooks to mitigate interference while maximizing the achievable performance with extremely large hybrid arrays. The proposed algorithm requires limited shared information yet designs codebooks that outperform traditional codebooks by over 10dB in beam alignment and achieve more than 25% improvements in network spectral efficiency.
- [716] arXiv:2403.03082 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Recall-Oriented Continual Learning with Generative Adversarial Meta-ModelComments: Accepted in AAAI-2024 (Oral presentation)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The stability-plasticity dilemma is a major challenge in continual learning, as it involves balancing the conflicting objectives of maintaining performance on previous tasks while learning new tasks. In this paper, we propose the recall-oriented continual learning framework to address this challenge. Inspired by the human brain's ability to separate the mechanisms responsible for stability and plasticity, our framework consists of a two-level architecture where an inference network effectively acquires new knowledge and a generative network recalls past knowledge when necessary. In particular, to maximize the stability of past knowledge, we investigate the complexity of knowledge depending on different representations, and thereby introducing generative adversarial meta-model (GAMM) that incrementally learns task-specific parameters instead of input data samples of the task. Through our experiments, we show that our framework not only effectively learns new knowledge without any disruption but also achieves high stability of previous knowledge in both task-aware and task-agnostic learning scenarios. Our code is available at: this https URL .
- [717] arXiv:2403.03089 (cross-list from q-bio.QM) [ pdf , ps , html , other ]
-
Title: VQSynery: Robust Drug Synergy Prediction With Vector Quantization MechanismSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The pursuit of optimizing cancer therapies is significantly advanced by the accurate prediction of drug synergy. Traditional methods, such as clinical trials, are reliable yet encumbered by extensive time and financial demands. The emergence of high-throughput screening and computational innovations has heralded a shift towards more efficient methodologies for exploring drug interactions. In this study, we present VQSynergy, a novel framework that employs the Vector Quantization (VQ) mechanism, integrated with gated residuals and a tailored attention mechanism, to enhance the precision and generalizability of drug synergy predictions. Our findings demonstrate that VQSynergy surpasses existing models in terms of robustness, particularly under Gaussian noise conditions, highlighting its superior performance and utility in the complex and often noisy domain of drug synergy research. This study underscores the potential of VQSynergy in revolutionizing the field through its advanced predictive capabilities, thereby contributing to the optimization of cancer treatment strategies.
- [718] arXiv:2403.03100 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion ModelsZeqian Ju , Yuancheng Wang , Kai Shen , Xu Tan , Detai Xin , Dongchao Yang , Yanqing Liu , Yichong Leng , Kaitao Song , Siliang Tang , Zhizheng Wu , Tao Qin , Xiang-Yang Li , Wei Ye , Shikun Zhang , Jiang Bian , Lei He , Jinyu Li , Sheng ZhaoComments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot waySubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.
- [719] arXiv:2403.03101 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: KnowAgent: Knowledge-Augmented Planning for LLM-Based AgentsYuqi Zhu , Shuofei Qiao , Yixin Ou , Shumin Deng , Ningyu Zhang , Shiwei Lyu , Yue Shen , Lei Liang , Jinjie Gu , Huajun ChenSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation. Code is available in this https URL .
- [720] arXiv:2403.03102 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: "In Dialogues We Learn": Towards Personalized Dialogue Without Pre-defined Profiles through In-Dialogue LearningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Personalized dialogue systems have gained significant attention in recent years for their ability to generate responses in alignment with different personas. However, most existing approaches rely on pre-defined personal profiles, which are not only time-consuming and labor-intensive to create but also lack flexibility. We propose In-Dialogue Learning (IDL), a fine-tuning framework that enhances the ability of pre-trained large language models to leverage dialogue history to characterize persona for completing personalized dialogue generation tasks without pre-defined profiles. Our experiments on three datasets demonstrate that IDL brings substantial improvements, with BLEU and ROUGE scores increasing by up to 200% and 247%, respectively. Additionally, the results of human evaluations further validate the efficacy of our proposed method.
- [721] arXiv:2403.03111 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Improved LiDAR Odometry and Mapping using Deep Semantic Segmentation and Novel Outliers DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Perception is a key element for enabling intelligent autonomous navigation. Understanding the semantics of the surrounding environment and accurate vehicle pose estimation are essential capabilities for autonomous vehicles, including self-driving cars and mobile robots that perform complex tasks. Fast moving platforms like self-driving cars impose a hard challenge for localization and mapping algorithms. In this work, we propose a novel framework for real-time LiDAR odometry and mapping based on LOAM architecture for fast moving platforms. Our framework utilizes semantic information produced by a deep learning model to improve point-to-line and point-to-plane matching between LiDAR scans and build a semantic map of the environment, leading to more accurate motion estimation using LiDAR data. We observe that including semantic information in the matching process introduces a new type of outlier matches to the process, where matching occur between different objects of the same semantic class. To this end, we propose a novel algorithm that explicitly identifies and discards potential outliers in the matching process. In our experiments, we study the effect of improving the matching process on the robustness of LiDAR odometry against high speed motion. Our experimental evaluations on KITTI dataset demonstrate that utilizing semantic information and rejecting outliers significantly enhance the robustness of LiDAR odometry and mapping when there are large gaps between scan acquisition poses, which is typical for fast moving platforms.
- [722] arXiv:2403.03114 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Equilibria in Two-Stage Facility Location with Atomic ClientsSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: We consider competitive facility location as a two-stage multi-agent system with two types of clients. For a given host graph with weighted clients on the vertices, first facility agents strategically select vertices for opening their facilities. Then, the clients strategically select which of the opened facilities in their neighborhood to patronize. Facilities want to attract as much client weight as possible, clients want to minimize congestion on the chosen facility.
All recently studied versions of this model assume that clients can split their weight strategically. We consider clients with unsplittable weights, but allow mixed strategies. So clients may randomize over which facility to patronize. Besides modeling a natural client behavior, this subtle change yields drastic changes, e.g., for a given facility placement, qualitatively different client equilibria are possible.
As our main result, we show that pure subgame perfect equilibria always exist if all client weights are identical. For this, we use a novel potential function argument, employing a hierarchical classification of the clients and sophisticated rounding in each step. In contrast, for non-identical clients, we show that deciding the existence of even approximately stable states is computationally intractable. On the positive side, we give a tight bound of 2 on the price of anarchy which implies high social welfare of equilibria, if they exist. - [723] arXiv:2403.03134 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Simplicity in Complexity : Explaining Visual Complexity using Deep Segmentation ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Abstract: The complexity of visual stimuli plays an important role in many cognitive phenomena, including attention, engagement, memorability, time perception and aesthetic evaluation. Despite its importance, complexity is poorly understood and ironically, previous models of image complexity have been quite complex. There have been many attempts to find handcrafted features that explain complexity, but these features are usually dataset specific, and hence fail to generalise. On the other hand, more recent work has employed deep neural networks to predict complexity, but these models remain difficult to interpret, and do not guide a theoretical understanding of the problem. Here we propose to model complexity using segment-based representations of images. We use state-of-the-art segmentation models, SAM and FC-CLIP, to quantify the number of segments at multiple granularities, and the number of classes in an image respectively. We find that complexity is well-explained by a simple linear model with these two features across six diverse image-sets of naturalistic scene and art images. This suggests that the complexity of images can be surprisingly simple.
- [724] arXiv:2403.03154 (cross-list from physics.comp-ph) [ pdf , ps , html , other ]
-
Title: Quantum Many-Body Physics Calculations with Large Language ModelsHaining Pan , Nayantara Mudur , Will Taranto , Maria Tikhanovskaya , Subhashini Venugopalan , Yasaman Bahri , Michael P. Brenner , Eun-Ah KimComments: 9 pages, 4 figures. Supplemental material in the source fileSubjects: Computational Physics (physics.comp-ph) ; Other Condensed Matter (cond-mat.other); Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade, demonstrating that, with correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases and makes minor errors in 2 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. Overall, the requisite skill for doing these calculations is at the graduate level in quantum condensed matter theory. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases. The strong performance is the first step for developing algorithms that automatically explore theoretical hypotheses at an unprecedented scale.
- [725] arXiv:2403.03168 (cross-list from math.NA) [ pdf , ps , html , other ]
-
Title: Learning Explicitly Conditioned Sparsifying TransformsSubjects: Numerical Analysis (math.NA) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
Abstract: Sparsifying transforms became in the last decades widely known tools for finding structured sparse representations of signals in certain transform domains. Despite the popularity of classical transforms such as DCT and Wavelet, learning optimal transforms that guarantee good representations of data into the sparse domain has been recently analyzed in a series of papers. Typically, the conditioning number and representation ability are complementary key features of learning square transforms that may not be explicitly controlled in a given optimization model. Unlike the existing approaches from the literature, in our paper, we consider a new sparsifying transform model that enforces explicit control over the data representation quality and the condition number of the learned transforms. We confirm through numerical experiments that our model presents better numerical behavior than the state-of-the-art.
- [726] arXiv:2403.03170 (cross-list from cs.MM) [ pdf , ps , html , other ]
-
Title: SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation DetectionComments: To appear in CVPR 2024Subjects: Multimedia (cs.MM) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Abstract: Misinformation is a prevalent societal issue due to its potential high risks. Out-of-context (OOC) misinformation, where authentic images are repurposed with false text, is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments, which is essential for debunking misinformation. While Multimodal Large Language Models (MLLMs) have rich knowledge and innate capability for visual reasoning and explanation generation, they still lack sophistication in understanding and discovering the subtle crossmodal differences. In this paper, we introduce SNIFFER, a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation. SNIFFER employs two-stage instruction tuning on InstructBLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the second stage leverages language-only GPT-4 generated OOC-specific instruction data to fine-tune the model's discriminatory powers. Enhanced by external tools and retrieval, SNIFFER not only detects inconsistencies between text and image but also utilizes external knowledge for contextual verification. Our experiments show that SNIFFER surpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. SNIFFER also provides accurate and persuasive explanations as validated by quantitative and human evaluations.
- [727] arXiv:2403.03174 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual PromptingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Open-vocabulary generalization requires robotic systems to perform tasks involving complex and diverse environments and task goals. While the recent advances in vision language models (VLMs) present unprecedented opportunities to solve unseen problems, how to utilize their emergent capabilities to control robots in the physical world remains an open question. In this paper, we present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language descriptions. At the heart of our approach is a compact point-based representation of affordance and motion that bridges the VLM's predictions on RGB images and the robot's motions in the physical world. By prompting a VLM pre-trained on Internet-scale data, our approach predicts the affordances and generates the corresponding motions by leveraging the concept understanding and commonsense knowledge from broad sources. To scaffold the VLM's reasoning in zero-shot, we propose a visual prompting technique that annotates marks on the images, converting the prediction of keypoints and waypoints into a series of visual question answering problems that are feasible for the VLM to solve. Using the robot experiences collected in this way, we further investigate ways to bootstrap the performance through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on a variety of manipulation tasks specified by free-form language descriptions, such as tool use, deformable body manipulation, and object rearrangement.
- [728] arXiv:2403.03181 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Behavior Generation with Latent ActionsSeungjae Lee , Yibin Wang , Haritheja Etukuru , H. Jin Kim , Nur Muhammad Mahi Shafiullah , Lerrel PintoComments: Github repo: this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Generative modeling of complex behaviors from labeled datasets has been a longstanding problem in decision making. Unlike language or image generation, decision making requires modeling actions - continuous-valued vectors that are multimodal in their distribution, potentially drawn from uncurated sources, where generation errors can compound in sequential prediction. A recent class of models called Behavior Transformers (BeT) addresses this by discretizing actions using k-means clustering to capture different modes. However, k-means struggles to scale for high-dimensional action spaces or long sequences, and lacks gradient information, and thus BeT suffers in modeling long-range actions. In this work, we present Vector-Quantized Behavior Transformer (VQ-BeT), a versatile model for behavior generation that handles multimodal action prediction, conditional generation, and partial observations. VQ-BeT augments BeT by tokenizing continuous actions with a hierarchical vector quantization module. Across seven environments including simulated manipulation, autonomous driving, and robotics, VQ-BeT improves on state-of-the-art models such as BeT and Diffusion Policies. Importantly, we demonstrate VQ-BeT's improved ability to capture behavior modes while accelerating inference speed 5x over Diffusion Policies. Videos and code can be found this https URL
- [729] arXiv:2403.03183 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: How Well Can Transformers Emulate In-context Newton's Method?Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher order optimization methods, beyond the case of linear regression. We establish that linear attention Transformers with ReLU layers can approximate second order optimization algorithms for the task of logistic regression and achieve $\epsilon$ error with only a logarithmic to the error more layers. As a by-product we demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers. These results suggest the ability of the Transformer architecture to implement complex algorithms, beyond gradient descent.
- [730] arXiv:2403.03185 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Preventing Reward Hacking with Occupancy Measure RegularizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Reward hacking occurs when an agent performs very well with respect to a "proxy" reward function (which may be hand-specified or learned), but poorly with respect to the unknown true reward. Since ensuring good alignment between the proxy and true reward is extremely difficult, one approach to prevent reward hacking is optimizing the proxy conservatively. Prior work has particularly focused on enforcing the learned policy to behave similarly to a "safe" policy by penalizing the KL divergence between their action distributions (AD). However, AD regularization doesn't always work well since a small change in action distribution at a single state can lead to potentially calamitous outcomes, while large changes might not be indicative of any dangerous activity. Our insight is that when reward hacking, the agent visits drastically different states from those reached by the safe policy, causing large deviations in state occupancy measure (OM). Thus, we propose regularizing based on the OM divergence between policies instead of AD divergence to prevent reward hacking. We theoretically establish that OM regularization can more effectively avoid large drops in true reward. Then, we empirically demonstrate in a variety of realistic environments that OM divergence is superior to AD divergence for preventing reward hacking by regularizing towards a safe policy. Furthermore, we show that occupancy measure divergence can also regularize learned policies away from reward hacking behavior. Our code and data are available at this https URL
- [731] arXiv:2403.03187 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reliable, Adaptable, and Attributable Language Models with RetrievalAkari Asai , Zexuan Zhong , Danqi Chen , Pang Wei Koh , Luke Zettlemoyer , Hannaneh Hajishirzi , Wen-tau YihSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Parametric language models (LMs), which are trained on vast amounts of web data, exhibit remarkable flexibility and capability. However, they still face practical challenges such as hallucinations, difficulty in adapting to new data distributions, and a lack of verifiability. In this position paper, we advocate for retrieval-augmented LMs to replace parametric LMs as the next generation of LMs. By incorporating large-scale datastores during inference, retrieval-augmented LMs can be more reliable, adaptable, and attributable. Despite their potential, retrieval-augmented LMs have yet to be widely adopted due to several obstacles: specifically, current retrieval-augmented LMs struggle to leverage helpful text beyond knowledge-intensive tasks such as question answering, have limited interaction between retrieval and LM components, and lack the infrastructure for scaling. To address these, we propose a roadmap for developing general-purpose retrieval-augmented LMs. This involves a reconsideration of datastores and retrievers, the exploration of pipelines with improved retriever-LM interaction, and significant investment in infrastructure for efficient training and inference.
- [732] arXiv:2403.03218 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningNathaniel Li , Alexander Pan , Anjali Gopal , Summer Yue , Daniel Berrios , Alice Gatti , Justin D. Li , Ann-Kathrin Dombrowski , Shashwat Goel , Long Phan , Gabriel Mukobi , Nathan Helm-Burger , Rassin Lababidi , Lennart Justen , Andrew B. Liu , Michael Chen , Isabelle Barrass , Oliver Zhang , Xiaoyuan Zhu , Rishub Tamirisa , Bhrugu Bharathi , Adam Khoja , Zhenqi Zhao , Ariel Herbert-Voss , Cort B. Breuer , Samuel Marks , Oam Patel , Andy Zou , Mantas Mazeika , Zifan Wang , Palash Oswal , Weiran Liu , Adam A. Hunt , Justin Tienken-Harder , Kevin Y. Shih , Kemper Talley , John Guan , Russell Kaplan , Ian Steneker , David Campbell , Brad Jokubaitis , Alex Levinson , Jean Wang , William Qian , Kallol Krishna Karmakar , Steven Basart , Stephen Fitz , Mindy Levine , Ponnurangam Kumaraguru , Uday Tupakula , Vijay Varadharajan , Ruoyu Wang , Yan Shoshitaishvili , Jimmy Ba , Kevin M. Esvelt , Alexandr Wang , Dan HendrycksComments: See the project page at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at this https URL
- [733] arXiv:2403.03222 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Knowledge-guided EEG Representation LearningComments: 6 Pages, 5 figures, Submitted to EMBC 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Self-supervised learning has produced impressive results in multimedia domains of audio, vision and speech. This paradigm is equally, if not more, relevant for the domain of biosignals, owing to the scarcity of labelled data in such scenarios. The ability to leverage large-scale unlabelled data to learn robust representations could help improve the performance of numerous inference tasks on biosignals. Given the inherent domain differences between multimedia modalities and biosignals, the established objectives for self-supervised learning may not translate well to this domain. Hence, there is an unmet need to adapt these methods to biosignal analysis. In this work we propose a self-supervised model for EEG, which provides robust performance and remarkable parameter efficiency by using state space-based deep learning architecture. We also propose a novel knowledge-guided pre-training objective that accounts for the idiosyncrasies of the EEG signal. The results indicate improved embedding representation learning and downstream performance compared to prior works on exemplary tasks. Also, the proposed objective significantly reduces the amount of pre-training data required to obtain performance equivalent to prior works.
- [734] arXiv:2403.03224 (cross-list from physics.soc-ph) [ pdf , ps , html , other ]
-
Title: Reinforcement Learning Jazz Improvisation: When Music Meets Game TheoryComments: 16 pages, 4 figuresSubjects: Physics and Society (physics.soc-ph) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Live performances of music are always charming, with the unpredictability of improvisation due to the dynamic between musicians and interactions with the audience. Jazz improvisation is a particularly noteworthy example for further investigation from a theoretical perspective. Here, we introduce a novel mathematical game theory model for jazz improvisation, providing a framework for studying music theory and improvisational methodologies. We use computational modeling, mainly reinforcement learning, to explore diverse stochastic improvisational strategies and their paired performance on improvisation. We find that the most effective strategy pair is a strategy that reacts to the most recent payoff (Stepwise Changes) with a reinforcement learning strategy limited to notes in the given chord (Chord-Following Reinforcement Learning). Conversely, a strategy that reacts to the partner's last note and attempts to harmonize with it (Harmony Prediction) strategy pair yields the lowest non-control payoff and highest standard deviation, indicating that picking notes based on immediate reactions to the partner player can yield inconsistent outcomes. On average, the Chord-Following Reinforcement Learning strategy demonstrates the highest mean payoff, while Harmony Prediction exhibits the lowest. Our work lays the foundation for promising applications beyond jazz: including the use of artificial intelligence (AI) models to extract data from audio clips to refine musical reward systems, and training machine learning (ML) models on existing jazz solos to further refine strategies within the game.
- [735] arXiv:2403.03230 (cross-list from q-bio.NC) [ pdf , ps , html , other ]
-
Title: Large language models surpass human experts in predicting neuroscience resultsXiaoliang Luo , Akilles Rechardt , Guangzhi Sun , Kevin K. Nejad , Felipe Yáñez , Bati Yilmaz , Kangjoo Lee , Alexandra O. Cohen , Valentina Borghesani , Anton Pashkov , Daniele Marinazzo , Jonathan Nicholas , Alessandro Salatiello , Ilia Sucholutsky , Pasquale Minervini , Sepehr Razavi , Roberta Rocca , Elkhan Yusifov , Tereza Okalova , Nianlong Gu , Martin Ferianc , Mikail Khona , Kaustubh R. Patil , Pui-Shee Lee , Rui Mata , Nicholas E. Myers , Jennifer K Bizley , Sebastian Musslick , Isil Poyraz Bilgin , Guiomar Niso , Justin M. Ales , Michael Gaebler , N Apurva Ratan Murty , Leyla Loued-Khenissi , Anna Behler , Chloe M. Hall , Jessica Dafflon , Sherry Dongqi Bao , Bradley C. LoveSubjects: Neurons and Cognition (q-bio.NC) ; Artificial Intelligence (cs.AI)
Abstract: Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. To evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs were confident in their predictions, they were more likely to be correct, which presages a future where humans and LLMs team together to make discoveries. Our approach is not neuroscience-specific and is transferable to other knowledge-intensive endeavors.
- [736] arXiv:2403.03239 (cross-list from physics.soc-ph) [ pdf , ps , other ]
-
Title: Note: Harnessing Tellurium Nanoparticles in the Digital Realm Plasmon Resonance, in the Context of Brewster's Angle and the Drude Model for Fake News Adsorption in Incomplete Information GamesComments: Tellurium Nanoparticles, Snell's Law, Soliton Solution, Anamorphic Surfaces, Nonlinear Dynamics, Fake News Adsorption, User Behavior Modeling, Health Improvement Strategies, Plasmonic Sensors This paper is partially an attempt to utilize "Generative AI" and was written with educational intent. There are currently no plans for it to become a peer-reviewed paperSubjects: Physics and Society (physics.soc-ph) ; Artificial Intelligence (cs.AI)
Abstract: This note explores the innovative application of soliton theory and plasmonic phenomena in modeling user behavior and engagement within digital health platforms. By introducing the concept of soliton solutions, we present a novel approach to understanding stable patterns of health improvement behaviors over time. Additionally, we delve into the role of tellurium nanoparticles and their plasmonic properties in adsorbing fake news, thereby influencing user interactions and engagement levels. Through a theoretical framework that combines nonlinear dynamics with the unique characteristics of tellurium nanoparticles, we aim to provide new insights into the dynamics of user engagement in digital health environments. Our analysis highlights the potential of soliton theory in capturing the complex, nonlinear dynamics of user behavior, while the application of plasmonic phenomena offers a promising avenue for enhancing the sensitivity and effectiveness of digital health platforms. This research ventures into an uncharted territory where optical phenomena such as Brewster's Angle and Snell's Law, along with the concept of spin solitons, are metaphorically applied to address the challenge of fake news dissemination. By exploring the analogy between light refraction, reflection, and the propagation of information in digital platforms, we unveil a novel perspective on how the 'angle' at which information is presented can significantly affect its acceptance and spread. Additionally, we propose the use of tellurium nanoparticles to manage 'information waves' through mechanisms akin to plasmonic resonance and soliton dynamics. This theoretical exploration aims to bridge the gap between physical sciences and digital communication, offering insights into the development of strategies for mitigating misinformation.
- [737] arXiv:2403.03274 (cross-list from q-bio.QM) [ pdf , ps , html , other ]
-
Title: From Noise to Signal: Unveiling Treatment Effects from Digital Health Data through Pharmacology-Informed Neural-SDEComments: 6 figuresSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
Abstract: Digital health technologies (DHT), such as wearable devices, provide personalized, continuous, and real-time monitoring of patient. These technologies are contributing to the development of novel therapies and personalized medicine. Gaining insight from these technologies requires appropriate modeling techniques to capture clinically-relevant changes in disease state. The data generated from these devices is characterized by being stochastic in nature, may have missing elements, and exhibits considerable inter-individual variability - thereby making it difficult to analyze using traditional longitudinal modeling techniques. We present a novel pharmacology-informed neural stochastic differential equation (SDE) model capable of addressing these challenges. Using synthetic data, we demonstrate that our approach is effective in identifying treatment effects and learning causal relationships from stochastic data, thereby enabling counterfactual simulation.
- [738] arXiv:2403.03276 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: ARNN: Attentive Recurrent Neural Network for Multi-channel EEG Signals to Identify Epileptic SeizuresComments: 9 pages, 7 figures, Journal PaperSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We proposed an Attentive Recurrent Neural Network (ARNN), which recurrently applies attention layers along a sequence and has linear complexity with respect to the sequence length. The proposed model operates on multi-channel EEG signals rather than single channel signals and leverages parallel computation. In this cell, the attention layer is a computational unit that efficiently applies self-attention and cross-attention mechanisms to compute a recurrent function over a wide number of state vectors and input signals. Our architecture is inspired in part by the attention layer and long short-term memory (LSTM) cells, and it uses long-short style gates, but it scales this typical cell up by several orders to parallelize for multi-channel EEG signals. It inherits the advantages of attention layers and LSTM gate while avoiding their respective drawbacks. We evaluated the model effectiveness through extensive experiments with heterogeneous datasets, including the CHB-MIT and UPenn and Mayos Clinic, CHB-MIT datasets. The empirical findings suggest that the ARNN model outperforms baseline methods such as LSTM, Vision Transformer (ViT), Compact Convolution Transformer (CCT), and R-Transformer (RT), showcasing superior performance and faster processing capabilities across a wide range of tasks. The code has been made publicly accessible at \url{ this https URL }.
- [739] arXiv:2403.03281 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Credibility-Aware Multi-Modal Fusion Using Probabilistic CircuitsSahil Sidheekh , Pranuthi Tenali , Saurabh Mathur , Erik Blasch , Kristian Kersting , Sriraam NatarajanSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We consider the problem of late multi-modal fusion for discriminative learning. Motivated by noisy, multi-source domains that require understanding the reliability of each data source, we explore the notion of credibility in the context of multi-modal fusion. We propose a combination function that uses probabilistic circuits (PCs) to combine predictive distributions over individual modalities. We also define a probabilistic measure to evaluate the credibility of each modality via inference queries over the PC. Our experimental evaluation demonstrates that our fusion method can reliably infer credibility while maintaining competitive performance with the state-of-the-art.
- [740] arXiv:2403.03305 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Best of Both Worlds: A Pliable and Generalizable Neuro-Symbolic Approach for Relation ClassificationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces a novel neuro-symbolic architecture for relation classification (RC) that combines rule-based methods with contemporary deep learning techniques. This approach capitalizes on the strengths of both paradigms: the adaptability of rule-based systems and the generalization power of neural networks. Our architecture consists of two components: a declarative rule-based model for transparent classification and a neural component to enhance rule generalizability through semantic text matching. Notably, our semantic matcher is trained in an unsupervised domain-agnostic way, solely with synthetic data. Further, these components are loosely coupled, allowing for rule modifications without retraining the semantic matcher. In our evaluation, we focused on two few-shot relation classification datasets: Few-Shot TACRED and a Few-Shot version of NYT29. We show that our proposed method outperforms previous state-of-the-art models in three out of four settings, despite not seeing any human-annotated training data. Further, we show that our approach remains modular and pliable, i.e., the corresponding rules can be locally modified to improve the overall model. Human interventions to the rules for the TACRED relation \texttt{org:parents} boost the performance on that relation by as much as 26\% relative improvement, without negatively impacting the other relations, and without retraining the semantic matching component.
- [741] arXiv:2403.03322 (cross-list from cs.SE) [ pdf , ps , other ]
-
Title: Deep Configuration Performance Learning: A Systematic Survey and TaxonomySubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Performance is arguably the most crucial attribute that reflects the behavior of a configurable software system. However, given the increasing scale and complexity of modern software, modeling and predicting how various configurations can impact performance becomes one of the major challenges in software maintenance. As such, performance is often modeled without having a thorough knowledge of the software system, but relying mainly on data, which fits precisely with the purpose of deep learning.
In this paper, we conduct a comprehensive review exclusively on the topic of deep learning for performance learning of configurable software, covering 948 searched papers spanning six indexing services, based on which 85 primary papers were extracted and analyzed. Our results summarize the key topics and statistics on how the configuration data is prepared; how the deep configuration performance learning model is built; how the model is evaluated and how they are exploited in different tasks related to software configuration. We also identify the good practice and the potentially problematic phenomena from the studies surveyed, together with insights on future opportunities for the field. To promote open science, all the raw results of this survey can be accessed at our repository: this https URL . - [742] arXiv:2403.03334 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DIVERSE: Deciphering Internet Views on the U.S. Military Through Video Comment Stance Analysis, A Novel Benchmark Dataset for Stance ClassificationComments: Paper under review for dataset track of ICWSM 2024. 11 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Stance detection of social media text is a key component of downstream tasks involving the identification of groups of users with opposing opinions on contested topics such as vaccination and within arguments. In particular, stance provides an indication of an opinion towards an entity. This paper introduces DIVERSE, a dataset of over 173,000 YouTube video comments annotated for their stance towards videos of the U.S. military. The stance is annotated through a human-guided, machine-assisted labeling methodology that makes use of weak signals of tone within the sentence as supporting indicators, as opposed to using manual annotations by humans. These weak signals consist of the presence of hate speech and sarcasm, the presence of specific keywords, the sentiment of the text, and the stance inference from two Large Language Models. The weak signals are then consolidated using a data programming model before each comment is annotated with a final stance label. On average, the videos have 200 comments each, and the stance of the comments skews slightly towards the "against" characterization for both the U.S. Army and the videos posted on the channel.
- [743] arXiv:2403.03344 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Learn to Code Sustainably: An Empirical Study on LLM-based Green Code GenerationTina Vartziotis , Ippolyti Dellatolas , George Dasoulas , Maximilian Schmidt , Florian Schneider , Tim Hoffmann , Sotirios Kotsopoulos , Michael KeckeisenSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: The increasing use of information technology has led to a significant share of energy consumption and carbon emissions from data centers. These contributions are expected to rise with the growing demand for big data analytics, increasing digitization, and the development of large artificial intelligence (AI) models. The need to address the environmental impact of software development has led to increased interest in green (sustainable) coding and claims that the use of AI models can lead to energy efficiency gains. Here, we provide an empirical study on green code and an overview of green coding practices, as well as metrics used to quantify the sustainability awareness of AI models. In this framework, we evaluate the sustainability of auto-generated code. The auto-generate codes considered in this study are produced by generative commercial AI language models, GitHub Copilot, OpenAI ChatGPT-3, and Amazon CodeWhisperer. Within our methodology, in order to quantify the sustainability awareness of these AI models, we propose a definition of the code's "green capacity", based on certain sustainability metrics. We compare the performance and green capacity of human-generated code and code generated by the three AI language models in response to easy-to-hard problem statements. Our findings shed light on the current capacity of AI models to contribute to sustainable software development.
- [744] arXiv:2403.03348 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Learning to Maximize Mutual Information for Chain-of-Thought DistillationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Knowledge distillation, the technique of transferring knowledge from large, complex models to smaller ones, marks a pivotal step towards efficient AI deployment. Distilling Step-by-Step (DSS), a novel method utilizing chain-of-thought (CoT) distillation, has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts. In DSS, the distilled model acquires the ability to generate rationales and predict labels concurrently through a multi-task learning framework. However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction. To this end, we investigate the mutual relationship of the two tasks from Information Bottleneck perspective and formulate it as maximizing the mutual information of the representation features of the two tasks. We propose a variational approach to solve this optimization problem using a learning-based method. Our experimental results across four datasets demonstrate that our method outperforms the state-of-the-art DSS. Our findings offer insightful guidance for future research on language model distillation as well as applications involving CoT. Code and models will be released soon.
- [745] arXiv:2403.03385 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Multi-modal Deep LearningComments: Master's thesisSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This article investigates deep learning methodologies for single-modality clinical data analysis, as a crucial precursor to multi-modal medical research. Building on Guo JingYuan's work, the study refines clinical data processing through Compact Convolutional Transformer (CCT), Patch Up, and the innovative CamCenterLoss technique, establishing a foundation for future multimodal investigations. The proposed methodology demonstrates improved prediction accuracy and at tentiveness to critically ill patients compared to Guo JingYuan's ResNet and StageNet approaches. Novelty that using image-pretrained vision transformer backbone to perform transfer learning time-series clinical data.The study highlights the potential of CCT, Patch Up, and novel CamCenterLoss in processing single modality clinical data within deep learning frameworks, paving the way for future multimodal medical research and promoting precision and personalized healthcare
- [746] arXiv:2403.03395 (cross-list from cs.SD) [ pdf , ps , other ]
-
Title: Interactive Melody Generation System for Enhancing the Creativity of MusiciansSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
Abstract: This study proposes a system designed to enumerate the process of collaborative composition among humans, using automatic music composition technology. By integrating multiple Recurrent Neural Network (RNN) models, the system provides an experience akin to collaborating with several composers, thereby fostering diverse creativity. Through dynamic adaptation to the user's creative intentions, based on feedback, the system enhances its capability to generate melodies that align with user preferences and creative needs. The system's effectiveness was evaluated through experiments with composers of varying backgrounds, revealing its potential to facilitate musical creativity and suggesting avenues for further refinement. The study underscores the importance of interaction between the composer and AI, aiming to make music composition more accessible and personalized. This system represents a step towards integrating AI into the creative process, offering a new tool for composition support and collaborative artistic exploration.
- [747] arXiv:2403.03407 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Human vs. Machine: Language Models and WargamesMax Lamparth , Anthony Corso , Jacob Ganz , Oriana Skylar Mastro , Jacquelyn Schneider , Harold TrinkunasSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Wargames have a long history in the development of military strategy and the response of nations to threats or attacks. The advent of artificial intelligence (AI) promises better decision-making and increased military effectiveness. However, there is still debate about how AI systems, especially large language models (LLMs), behave as compared to humans. To this end, we use a wargame experiment with 107 national security expert human players designed to look at crisis escalation in a fictional US-China scenario and compare human players to LLM-simulated responses. We find considerable agreement in the LLM and human responses but also significant quantitative and qualitative differences between simulated and human players in the wargame, motivating caution to policymakers before handing over autonomy or following AI-based strategy recommendations.
- [748] arXiv:2403.03409 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Sparse Spiking Neural Network: Exploiting Heterogeneity in Timescales for Pruning Recurrent SNNComments: Published as a conference paper at ICLR 2024Journal-ref: ICLR 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Recurrent Spiking Neural Networks (RSNNs) have emerged as a computationally efficient and brain-inspired learning model. The design of sparse RSNNs with fewer neurons and synapses helps reduce the computational complexity of RSNNs. Traditionally, sparse SNNs are obtained by first training a dense and complex SNN for a target task, and, then, pruning neurons with low activity (activity-based pruning) while maintaining task performance. In contrast, this paper presents a task-agnostic methodology for designing sparse RSNNs by pruning a large randomly initialized model. We introduce a novel Lyapunov Noise Pruning (LNP) algorithm that uses graph sparsification methods and utilizes Lyapunov exponents to design a stable sparse RSNN from a randomly initialized RSNN. We show that the LNP can leverage diversity in neuronal timescales to design a sparse Heterogeneous RSNN (HRSNN). Further, we show that the same sparse HRSNN model can be trained for different tasks, such as image classification and temporal prediction. We experimentally show that, in spite of being task-agnostic, LNP increases computational efficiency (fewer neurons and synapses) and prediction performance of RSNNs compared to traditional activity-based pruning of trained dense models.
- [749] arXiv:2403.03419 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference OptimizationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have revolutionized the role of AI, yet also pose potential risks of propagating unethical content. Alignment technologies have been introduced to steer LLMs towards human preference, gaining increasing attention. Despite notable breakthroughs in this direction, existing methods heavily rely on high-quality positive-negative training pairs, suffering from noisy labels and the marginal distinction between preferred and dispreferred response data. Given recent LLMs' proficiency in generating helpful responses, this work pivots towards a new research focus: achieving alignment using solely human-annotated negative samples, preserving helpfulness while reducing harmfulness. For this purpose, we propose Distributional Dispreference Optimization (D$^2$O), which maximizes the discrepancy between the generated responses and the dispreferred ones to effectively eschew harmful information. We theoretically demonstrate that D$^2$O is equivalent to learning a distributional instead of instance-level preference model reflecting human dispreference against the distribution of negative responses. Besides, D$^2$O integrates an implicit Jeffrey Divergence regularization to balance the exploitation and exploration of reference policies and converges to a non-negative one during training. Extensive experiments demonstrate that our method achieves comparable generation quality and surpasses the latest baselines in producing less harmful and more informative responses with better training stability and faster convergence.
- [750] arXiv:2403.03421 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LEAD: Learning Decomposition for Source-free Universal Domain AdaptationSanqing Qu , Tianpei Zou , Lianghua He , Florian Röhrbein , Alois Knoll , Guang Chen , Changjun JiangComments: To appear in CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Universal Domain Adaptation (UniDA) targets knowledge transfer in the presence of both covariate and label shifts. Recently, Source-free Universal Domain Adaptation (SF-UniDA) has emerged to achieve UniDA without access to source data, which tends to be more practical due to data protection policies. The main challenge lies in determining whether covariate-shifted samples belong to target-private unknown categories. Existing methods tackle this either through hand-crafted thresholding or by developing time-consuming iterative clustering strategies. In this paper, we propose a new idea of LEArning Decomposition (LEAD), which decouples features into source-known and -unknown components to identify target-private data. Technically, LEAD initially leverages the orthogonal decomposition analysis for feature decomposition. Then, LEAD builds instance-level decision boundaries to adaptively identify target-private data. Extensive experiments across various UniDA scenarios have demonstrated the effectiveness and superiority of LEAD. Notably, in the OPDA scenario on VisDA dataset, LEAD outperforms GLC by 3.5% overall H-score and reduces 75% time to derive pseudo-labeling decision boundaries. Besides, LEAD is also appealing in that it is complementary to most existing methods. The code is available at this https URL .
- [751] arXiv:2403.03432 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language ModelsComments: 10 pages, COLING24 AcceptedSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Instruction Tuning has the potential to stimulate or enhance specific capabilities of large language models (LLMs). However, achieving the right balance of data is crucial to prevent catastrophic forgetting and interference between tasks. To address these limitations and enhance training flexibility, we propose the Mixture-of-LoRAs (MoA) architecture which is a novel and parameter-efficient tuning method designed for multi-task learning with LLMs. In this paper, we start by individually training multiple domain-specific LoRA modules using corresponding supervised corpus data. These LoRA modules can be aligned with the expert design principles observed in Mixture-of-Experts (MoE). Subsequently, we combine the multiple LoRAs using an explicit routing strategy and introduce domain labels to facilitate multi-task learning, which help prevent interference between tasks and ultimately enhances the performance of each individual task. Furthermore, each LoRA model can be iteratively adapted to a new domain, allowing for quick domain-specific adaptation. Experiments on diverse tasks demonstrate superior and robust performance, which can further promote the wide application of domain-specific LLMs.
- [752] arXiv:2403.03444 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Uncertainty quantification for deeponets with ensemble kalman inversionComments: 25 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Abstract: In recent years, operator learning, particularly the DeepONet, has received much attention for efficiently learning complex mappings between input and output functions across diverse fields. However, in practical scenarios with limited and noisy data, accessing the uncertainty in DeepONet predictions becomes essential, especially in mission-critical or safety-critical applications. Existing methods, either computationally intensive or yielding unsatisfactory uncertainty quantification, leave room for developing efficient and informative uncertainty quantification (UQ) techniques tailored for DeepONets. In this work, we proposed a novel inference approach for efficient UQ for operator learning by harnessing the power of the Ensemble Kalman Inversion (EKI) approach. EKI, known for its derivative-free, noise-robust, and highly parallelizable feature, has demonstrated its advantages for UQ for physics-informed neural networks [28]. Our innovative application of EKI enables us to efficiently train ensembles of DeepONets while obtaining informative uncertainty estimates for the output of interest. We deploy a mini-batch variant of EKI to accommodate larger datasets, mitigating the computational demand due to large datasets during the training stage. Furthermore, we introduce a heuristic method to estimate the artificial dynamics covariance, thereby improving our uncertainty estimates. Finally, we demonstrate the effectiveness and versatility of our proposed methodology across various benchmark problems, showcasing its potential to address the pressing challenges of uncertainty quantification in DeepONets, especially for practical applications with limited and noisy data.
- [753] arXiv:2403.03456 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DLP-GAN: learning to draw modern Chinese landscape photos with generative adversarial networkComments: Corrected typosJournal-ref: Neural Computing and Applications, 2023: 1-18Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Chinese landscape painting has a unique and artistic style, and its drawing technique is highly abstract in both the use of color and the realistic representation of objects. Previous methods focus on transferring from modern photos to ancient ink paintings. However, little attention has been paid to translating landscape paintings into modern photos. To solve such problems, in this paper, we (1) propose DLP-GAN (Draw Modern Chinese Landscape Photos with Generative Adversarial Network), an unsupervised cross-domain image translation framework with a novel asymmetric cycle mapping, and (2) introduce a generator based on a dense-fusion module to match different translation directions. Moreover, a dual-consistency loss is proposed to balance the realism and abstraction of model painting. In this way, our model can draw landscape photos and sketches in the modern sense. Finally, based on our collection of modern landscape and sketch datasets, we compare the images generated by our model with other benchmarks. Extensive experiments including user studies show that our model outperforms state-of-the-art methods.
- [754] arXiv:2403.03506 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Detecting AI-Generated Sentences in Realistic Human-AI Collaborative Hybrid Texts: Challenges, Strategies, and InsightsZijie Zeng , Shiqi Liu , Lele Sha , Zhuang Li , Kaixun Yang , Sannyuya Liu , Dragan Gašević , Guanliang ChenComments: Accepted as a full paper on IJCAI 2024 (Special Track: AI and Social Good)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This study explores the challenge of sentence-level AI-generated text detection within human-AI collaborative hybrid texts. Existing studies of AI-generated text detection for hybrid texts often rely on synthetic datasets. These typically involve hybrid texts with a limited number of boundaries. We contend that studies of detecting AI-generated content within hybrid texts should cover different types of hybrid texts generated in realistic settings to better inform real-world applications. Therefore, our study utilizes the CoAuthor dataset, which includes diverse, realistic hybrid texts generated through the collaboration between human writers and an intelligent writing system in multi-turn interactions. We adopt a two-step, segmentation-based pipeline: (i) detect segments within a given hybrid text where each segment contains sentences of consistent authorship, and (ii) classify the authorship of each identified segment. Our empirical findings highlight (1) detecting AI-generated sentences in hybrid texts is overall a challenging task because (1.1) human writers' selecting and even editing AI-generated sentences based on personal preferences adds difficulty in identifying the authorship of segments; (1.2) the frequent change of authorship between neighboring sentences within the hybrid text creates difficulties for segment detectors in identifying authorship-consistent segments; (1.3) the short length of text segments within hybrid texts provides limited stylistic cues for reliable authorship determination; (2) before embarking on the detection process, it is beneficial to assess the average length of segments within the hybrid text. This assessment aids in deciding whether (2.1) to employ a text segmentation-based strategy for hybrid texts with longer segments, or (2.2) to adopt a direct sentence-by-sentence classification strategy for those with shorter segments.
- [755] arXiv:2403.03536 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Towards Efficient and Effective Unlearning of Large Language Models for RecommendationComments: 12 pagesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: The significant advancements in large language models (LLMs) give rise to a promising research direction, i.e., leveraging LLMs as recommenders (LLMRec). The efficacy of LLMRec arises from the open-world knowledge and reasoning capabilities inherent in LLMs. LLMRec acquires the recommendation capabilities through instruction tuning based on user interaction data. However, in order to protect user privacy and optimize utility, it is also crucial for LLMRec to intentionally forget specific user data, which is generally referred to as recommendation unlearning. In the era of LLMs, recommendation unlearning poses new challenges for LLMRec in terms of \textit{inefficiency} and \textit{ineffectiveness}. Existing unlearning methods require updating billions of parameters in LLMRec, which is costly and time-consuming. Besides, they always impact the model utility during the unlearning process. To this end, we propose \textbf{E2URec}, the first \underline{E}fficient and \underline{E}ffective \underline{U}nlearning method for LLM\underline{Rec}. Our proposed E2URec enhances the unlearning efficiency by updating only a few additional LoRA parameters, and improves the unlearning effectiveness by employing a teacher-student framework, where we maintain multiple teacher networks to guide the unlearning process. Extensive experiments show that E2URec outperforms state-of-the-art baselines on two real-world datasets. Specifically, E2URec can efficiently forget specific data without affecting recommendation performance. The source code is at \url{ this https URL }.
- [756] arXiv:2403.03538 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: RADIA -- Radio Advertisement Detection with Intelligent AnalyticsJorge Álvarez , Juan Carlos Armenteros , Camilo Torrón , Miguel Ortega-Martín , Alfonso Ardoiz , Óscar García , Ignacio Arranz , Íñigo Galdeano , Ignacio Garrido , Adrián Alonso , Fernando Bayón , Oleg VorontsovSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Abstract: Radio advertising remains an integral part of modern marketing strategies, with its appeal and potential for targeted reach undeniably effective. However, the dynamic nature of radio airtime and the rising trend of multiple radio spots necessitates an efficient system for monitoring advertisement broadcasts. This study investigates a novel automated radio advertisement detection technique incorporating advanced speech recognition and text classification algorithms. RadIA's approach surpasses traditional methods by eliminating the need for prior knowledge of the broadcast content. This contribution allows for detecting impromptu and newly introduced advertisements, providing a comprehensive solution for advertisement detection in radio broadcasting. Experimental results show that the resulting model, trained on carefully segmented and tagged text data, achieves an F1-macro score of 87.76 against a theoretical maximum of 89.33. This paper provides insights into the choice of hyperparameters and their impact on the model's performance. This study demonstrates its potential to ensure compliance with advertising broadcast contracts and offer competitive surveillance. This groundbreaking research could fundamentally change how radio advertising is monitored and open new doors for marketing optimization.
- [757] arXiv:2403.03575 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: gaHealth: An English-Irish Bilingual Corpus of Health DataComments: arXiv admin note: text overlap with arXiv:2403.02367Journal-ref: In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6753-6758, Marseille, France. European Language Resources Association, 2022Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Machine Translation is a mature technology for many high-resource language pairs. However in the context of low-resource languages, there is a paucity of parallel data datasets available for developing translation models. Furthermore, the development of datasets for low-resource languages often focuses on simply creating the largest possible dataset for generic translation. The benefits and development of smaller in-domain datasets can easily be overlooked. To assess the merits of using in-domain data, a dataset for the specific domain of health was developed for the low-resource English to Irish language pair. Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for the health domain. In the context of translating health-related data, models developed using the gaHealth corpus demonstrated a maximum BLEU score improvement of 22.2 points (40%) when compared with top performing models from the LoResMT2021 Shared Task. Furthermore, we define linguistic guidelines for developing gaHealth, the first bilingual corpus of health data for the Irish language, which we hope will be of use to other creators of low-resource data sets. gaHealth is now freely available online and is ready to be explored for further research.
- [758] arXiv:2403.03578 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Causal Disentanglement for Regulating Social Influence Bias in Social RecommendationSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI)
Abstract: Social recommendation systems face the problem of social influence bias, which can lead to an overemphasis on recommending items that friends have interacted with. Addressing this problem is crucial, and existing methods often rely on techniques such as weight adjustment or leveraging unbiased data to eliminate this bias. However, we argue that not all biases are detrimental, i.e., some items recommended by friends may align with the user's interests. Blindly eliminating such biases could undermine these positive effects, potentially diminishing recommendation accuracy. In this paper, we propose a Causal Disentanglement-based framework for Regulating Social influence Bias in social recommendation, named CDRSB, to improve recommendation performance. From the perspective of causal inference, we find that the user social network could be regarded as a confounder between the user and item embeddings (treatment) and ratings (outcome). Due to the presence of this social network confounder, two paths exist from user and item embeddings to ratings: a non-causal social influence path and a causal interest path. Building upon this insight, we propose a disentangled encoder that focuses on disentangling user and item embeddings into interest and social influence embeddings. Mutual information-based objectives are designed to enhance the distinctiveness of these disentangled embeddings, eliminating redundant information. Additionally, a regulatory decoder that employs a weight calculation module to dynamically learn the weights of social influence embeddings for effectively regulating social influence bias has been designed. Experimental results on four large-scale real-world datasets Ciao, Epinions, Dianping, and Douban book demonstrate the effectiveness of CDRSB compared to state-of-the-art baselines.
- [759] arXiv:2403.03582 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Design of an Open-Source Architecture for Neural Machine TranslationComments: arXiv admin note: substantial text overlap with arXiv:2403.02367Journal-ref: In Proceedings of the 1st Workshop on Open Community-Driven Machine Translation, pages 15-20, Tampere, Finland. European Association for Machine Translation, 2023Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: adaptNMT is an open-source application that offers a streamlined approach to the development and deployment of Recurrent Neural Networks and Transformer models. This application is built upon the widely-adopted OpenNMT ecosystem, and is particularly useful for new entrants to the field, as it simplifies the setup of the development environment and creation of train, validation, and test splits. The application offers a graphing feature that illustrates the progress of model training, and employs SentencePiece for creating subword segmentation models. Furthermore, the application provides an intuitive user interface that facilitates hyperparameter customization. Notably, a single-click model development approach has been implemented, and models developed by adaptNMT can be evaluated using a range of metrics. To encourage eco-friendly research, adaptNMT incorporates a green report that flags the power consumption and kgCO${_2}$ emissions generated during model development. The application is freely available.
- [760] arXiv:2403.03585 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: RouteExplainer: An Explanation Framework for Vehicle Routing ProblemComments: Accepted at PAKDD 2024. This extended version includes more comprehensive explanations and appendicesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: The Vehicle Routing Problem (VRP) is a widely studied combinatorial optimization problem and has been applied to various practical problems. While the explainability for VRP is significant for improving the reliability and interactivity in practical VRP applications, it remains unexplored. In this paper, we propose RouteExplainer, a post-hoc explanation framework that explains the influence of each edge in a generated route. Our framework realizes this by rethinking a route as the sequence of actions and extending counterfactual explanations based on the action influence model to VRP. To enhance the explanation, we additionally propose an edge classifier that infers the intentions of each edge, a loss function to train the edge classifier, and explanation-text generation by Large Language Models (LLMs). We quantitatively evaluate our edge classifier on four different VRPs. The results demonstrate its rapid computation while maintaining reasonable accuracy, thereby highlighting its potential for deployment in practical applications. Moreover, on the subject of a tourist route, we qualitatively evaluate explanations generated by our framework. This evaluation not only validates our framework but also shows the synergy between explanation frameworks and LLMs. See this https URL for our code, datasets, models, and demo.
- [761] arXiv:2403.03592 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Wildest Dreams: Reproducible Research in Privacy-preserving Neural Network TrainingSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Machine Learning (ML), addresses a multitude of complex issues in multiple disciplines, including social sciences, finance, and medical research. ML models require substantial computing power and are only as powerful as the data utilized. Due to high computational cost of ML methods, data scientists frequently use Machine Learning-as-a-Service (MLaaS) to outsource computation to external servers. However, when working with private information, like financial data or health records, outsourcing the computation might result in privacy issues. Recent advances in Privacy-Preserving Techniques (PPTs) have enabled ML training and inference over protected data through the use of Privacy-Preserving Machine Learning (PPML). However, these techniques are still at a preliminary stage and their application in real-world situations is demanding. In order to comprehend discrepancy between theoretical research suggestions and actual applications, this work examines the past and present of PPML, focusing on Homomorphic Encryption (HE) and Secure Multi-party Computation (SMPC) applied to ML. This work primarily focuses on the ML model's training phase, where maintaining user data privacy is of utmost importance. We provide a solid theoretical background that eases the understanding of current approaches and their limitations. In addition, we present a SoK of the most recent PPML frameworks for model training and provide a comprehensive comparison in terms of the unique properties and performances on standard benchmarks. Also, we reproduce the results for some of the papers and examine at what level existing works in the field provide support for open science. We believe our work serves as a valuable contribution by raising awareness about the current gap between theoretical advancements and real-world applications in PPML, specifically regarding open-source availability, reproducibility, and usability.
- [762] arXiv:2403.03593 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Do You Trust Your Model? Emerging Malware Threats in the Deep Learning EcosystemDorjan Hitaj , Giulio Pagnotta , Fabio De Gaspari , Sediola Ruko , Briland Hitaj , Luigi V. Mancini , Fernando Perez-CruzComments: 16 pages, 9 figuresSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Training high-quality deep learning models is a challenging task due to computational and technical requirements. A growing number of individuals, institutions, and companies increasingly rely on pre-trained, third-party models made available in public repositories. These models are often used directly or integrated in product pipelines with no particular precautions, since they are effectively just data in tensor form and considered safe. In this paper, we raise awareness of a new machine learning supply chain threat targeting neural networks. We introduce MaleficNet 2.0, a novel technique to embed self-extracting, self-executing malware in neural networks. MaleficNet 2.0 uses spread-spectrum channel coding combined with error correction techniques to inject malicious payloads in the parameters of deep neural networks. MaleficNet 2.0 injection technique is stealthy, does not degrade the performance of the model, and is robust against removal techniques. We design our approach to work both in traditional and distributed learning settings such as Federated Learning, and demonstrate that it is effective even when a reduced number of bits is used for the model parameters. Finally, we implement a proof-of-concept self-extracting neural network malware using MaleficNet 2.0, demonstrating the practicality of the attack against a widely adopted machine learning framework. Our aim with this work is to raise awareness against these new, dangerous attacks both in the research community and industry, and we hope to encourage further research in mitigation techniques against such threats.
- [763] arXiv:2403.03606 (cross-list from q-fin.CP) [ pdf , ps , other ]
-
Title: Enhancing Price Prediction in Cryptocurrency Using Transformer Neural Network and Technical IndicatorsSubjects: Computational Finance (q-fin.CP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study presents an innovative approach for predicting cryptocurrency time series, specifically focusing on Bitcoin, Ethereum, and Litecoin. The methodology integrates the use of technical indicators, a Performer neural network, and BiLSTM (Bidirectional Long Short-Term Memory) to capture temporal dynamics and extract significant features from raw cryptocurrency data. The application of technical indicators, such facilitates the extraction of intricate patterns, momentum, volatility, and trends. The Performer neural network, employing Fast Attention Via positive Orthogonal Random features (FAVOR+), has demonstrated superior computational efficiency and scalability compared to the traditional Multi-head attention mechanism in Transformer models. Additionally, the integration of BiLSTM in the feedforward network enhances the model's capacity to capture temporal dynamics in the data, processing it in both forward and backward directions. This is particularly advantageous for time series data where past and future data points can influence the current state. The proposed method has been applied to the hourly and daily timeframes of the major cryptocurrencies and its performance has been benchmarked against other methods documented in the literature. The results underscore the potential of the proposed method to outperform existing models, marking a significant progression in the field of cryptocurrency price prediction.
- [764] arXiv:2403.03608 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene UnderstandingComments: Accepted by CVPR2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Utilizing multi-view inputs to synthesize novel-view images, Neural Radiance Fields (NeRF) have emerged as a popular research topic in 3D vision. In this work, we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF), which uniquely takes image semantics into the synthesis process so that both novel view images and the associated semantic maps can be produced for unseen scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and Depth-Guided Visual rendering. The former is able to observe multi-view image inputs to extract semantic and geometry features from a scene. Guided by the resulting image geometry information, the latter performs both image and semantic rendering with improved performances. Our experiments not only confirm that GSNeRF performs favorably against prior works on both novel-view image and semantic segmentation synthesis but the effectiveness of our sampling strategy for visual rendering is further verified.
- [765] arXiv:2403.03627 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Multimodal Large Language Models to Support Real-World Fact-CheckingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. Here is aim to bridge this gap. In particular, we propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking. Our methodology is evidence-free, leveraging only these models' intrinsic knowledge and reasoning capabilities. By designing prompts that extract models' predictions, explanations, and confidence levels, we delve into research questions concerning model accuracy, robustness, and reasons for failure. We empirically find that (1) GPT-4V exhibits superior performance in identifying malicious and misleading multimodal claims, with the ability to explain the unreasonable aspects and underlying motives, and (2) existing open-source models exhibit strong biases and are highly sensitive to the prompt. Our study offers insights into combating false multimodal information and building secure, trustworthy multimodal models. To the best of our knowledge, we are the first to evaluate MLLMs for real-world fact-checking.
- [766] arXiv:2403.03640 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Apollo: An Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B PeopleXidong Wang , Nuo Chen , Junyin Chen , Yan Hu , Yidong Wang , Xiangbo Wu , Anningzhe Gao , Xiang Wan , Haizhou Li , Benyou WangComments: PreprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.
- [767] arXiv:2403.03643 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Survey on Applications of Reinforcement Learning in Spatial Resource AllocationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The challenge of spatial resource allocation is pervasive across various domains such as transportation, industry, and daily life. As the scale of real-world issues continues to expand and demands for real-time solutions increase, traditional algorithms face significant computational pressures, struggling to achieve optimal efficiency and real-time capabilities. In recent years, with the escalating computational power of computers, the remarkable achievements of reinforcement learning in domains like Go and robotics have demonstrated its robust learning and sequential decision-making capabilities. Given these advancements, there has been a surge in novel methods employing reinforcement learning to tackle spatial resource allocation problems. These methods exhibit advantages such as rapid solution convergence and strong model generalization abilities, offering a new perspective on resolving spatial resource allocation problems. Therefore, this paper aims to summarize and review recent theoretical methods and applied research utilizing reinforcement learning to address spatial resource allocation problems. It provides a summary and comprehensive overview of its fundamental principles, related methodologies, and applied research. Additionally, it highlights several unresolved issues that urgently require attention in this direction for the future.
- [768] arXiv:2403.03689 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: General2Specialized LLMs Translation for E-commerceKaidi Chen , Ben Chen , Dehong Gao , Huangyu Dai , Wen Jiang , Wei Ning , Shanqing Yu , Libin Yang , Xiaoyan CaiComments: 4 pages, 1 figure, WWW2024 acceptedSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Existing Neural Machine Translation (NMT) models mainly handle translation in the general domain, while overlooking domains with special writing formulas, such as e-commerce and legal documents. Taking e-commerce as an example, the texts usually include amounts of domain-related words and have more grammar problems, which leads to inferior performances of current NMT methods. To address these problems, we collect two domain-related resources, including a set of term pairs (aligned Chinese-English bilingual terms) and a parallel corpus annotated for the e-commerce domain. Furthermore, we propose a two-step fine-tuning paradigm (named G2ST) with self-contrastive semantic enhancement to transfer one general NMT model to the specialized NMT model for e-commerce. The paradigm can be used for the NMT models based on Large language models (LLMs). Extensive evaluations on real e-commerce titles demonstrate the superior translation quality and robustness of our G2ST approach, as compared with state-of-the-art NMT models such as LLaMA, Qwen, GPT-3.5, and even GPT-4.
- [769] arXiv:2403.03690 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on JapaneseComments: COLING 2024. Our code are available here: \href{ this https URL }{self-instruct data} and \href{ this https URL }{evaluation benchmark}Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The creation of instruction data and evaluation benchmarks for serving Large language models often involves enormous human annotation. This issue becomes particularly pronounced when rapidly developing such resources for a non-English language like Japanese. Instead of following the popular practice of directly translating existing English resources into Japanese (e.g., Japanese-Alpaca), we propose an efficient self-instruct method based on GPT-4. We first translate a small amount of English instructions into Japanese and post-edit them to obtain native-level quality. GPT-4 then utilizes them as demonstrations to automatically generate Japanese instruction data. We also construct an evaluation benchmark containing 80 questions across 8 categories, using GPT-4 to automatically assess the response quality of LLMs without human references. The empirical results suggest that the models fine-tuned on our GPT-4 self-instruct data significantly outperformed the Japanese-Alpaca across all three base pre-trained models. Our GPT-4 self-instruct data allowed the LLaMA 13B model to defeat GPT-3.5 (Davinci-003) with a 54.37\% win-rate. The human evaluation exhibits the consistency between GPT-4's assessments and human preference. Our high-quality instruction data and evaluation benchmark have been released here.
- [770] arXiv:2403.03691 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MolNexTR: A Generalized Deep Learning Model for Molecular Image RecognitionComments: Submitted to the Journal of CheminformaticsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the field of chemical structure recognition, the task of converting molecular images into graph structures and SMILES string stands as a significant challenge, primarily due to the varied drawing styles and conventions prevalent in chemical literature. To bridge this gap, we proposed MolNexTR, a novel image-to-graph deep learning model that collaborates to fuse the strengths of ConvNext, a powerful Convolutional Neural Network variant, and Vision-TRansformer. This integration facilitates a more nuanced extraction of both local and global features from molecular images. MolNexTR can predict atoms and bonds simultaneously and understand their layout rules. It also excels at flexibly integrating symbolic chemistry principles to discern chirality and decipher abbreviated structures. We further incorporate a series of advanced algorithms, including improved data augmentation module, image contamination module, and a post-processing module to get the final SMILES output. These modules synergistically enhance the model's robustness against the diverse styles of molecular imagery found in real literature. In our test sets, MolNexTR has demonstrated superior performance, achieving an accuracy rate of 81-97%, marking a significant advancement in the domain of molecular structure recognition. Scientific contribution: MolNexTR is a novel image-to-graph model that incorporates a unique dual-stream encoder to extract complex molecular image features, and combines chemical rules to predict atoms and bonds while understanding atom and bond layout rules. In addition, it employs a series of novel augmentation algorithms to significantly enhance the robustness and performance of the model.
- [771] arXiv:2403.03698 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Controllable Time Series GenerationComments: 14 pages, 13 figures, and 5 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Databases (cs.DB)
Abstract: Time Series Generation (TSG) has emerged as a pivotal technique in synthesizing data that accurately mirrors real-world time series, becoming indispensable in numerous applications. Despite significant advancements in TSG, its efficacy frequently hinges on having large training datasets. This dependency presents a substantial challenge in data-scarce scenarios, especially when dealing with rare or unique conditions. To confront these challenges, we explore a new problem of Controllable Time Series Generation (CTSG), aiming to produce synthetic time series that can adapt to various external conditions, thereby tackling the data scarcity issue.
In this paper, we propose \textbf{C}ontrollable \textbf{T}ime \textbf{S}eries (\textsf{CTS}), an innovative VAE-agnostic framework tailored for CTSG. A key feature of \textsf{CTS} is that it decouples the mapping process from standard VAE training, enabling precise learning of a complex interplay between latent features and external conditions. Moreover, we develop a comprehensive evaluation scheme for CTSG. Extensive experiments across three real-world time series datasets showcase \textsf{CTS}'s exceptional capabilities in generating high-quality, controllable outputs. This underscores its adeptness in seamlessly integrating latent features with external conditions. Extending \textsf{CTS} to the image domain highlights its remarkable potential for explainability and further reinforces its versatility across different modalities. - [772] arXiv:2403.03726 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Diffusion on language model embeddings for protein sequence generationViacheslav Meshchaninov , Pavel Strashnov , Andrey Shevtsov , Fedor Nikolaev , Nikita Ivanisenko , Olga Kardymon , Dmitry VetrovSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Abstract: Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived from the protein language model, ESM-2, to generate amino acid sequences. DiMA surpasses leading solutions, including autoregressive transformer-based and discrete diffusion models, and we quantitatively illustrate the impact of the design choices that lead to its superior performance. We extensively evaluate the quality, diversity, distribution similarity, and biological relevance of the generated sequences using multiple metrics across various modalities. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. This work advances the field of protein design and sets the stage for conditional models by providing a robust framework for scalable and high-quality protein sequence generation.
- [773] arXiv:2403.03728 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Bridging Diversity and Uncertainty in Active learning with Self-Supervised Pre-TrainingComments: Accepted at ICLR 2024 Workshop on Practical Machine Learning for Low Resource Settings (PML4LRS)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This study addresses the integration of diversity-based and uncertainty-based sampling strategies in active learning, particularly within the context of self-supervised pre-trained models. We introduce a straightforward heuristic called TCM that mitigates the cold start problem while maintaining strong performance across various data levels. By initially applying TypiClust for diversity sampling and subsequently transitioning to uncertainty sampling with Margin, our approach effectively combines the strengths of both strategies. Our experiments demonstrate that TCM consistently outperforms existing methods across various datasets in both low and high data regimes.
- [774] arXiv:2403.03730 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning 3D object-centric representation through predictionComments: 21 pages, 11 figures. Project webpage can be found at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: As part of human core knowledge, the representation of objects is the building block of mental representation that supports high-level concepts and symbolic reasoning. While humans develop the ability of perceiving objects situated in 3D environments without supervision, models that learn the same set of abilities with similar constraints faced by human infants are lacking. Towards this end, we developed a novel network architecture that simultaneously learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth, all while using only information directly available to the brain as training data, namely: sequences of images and self-motion. The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes. This results in object representations being learned as an essential byproduct of learning to predict.
- [775] arXiv:2403.03739 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural NetworkComments: CVPR 2024 AcceptedSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Binary neural networks utilize 1-bit quantized weights and activations to reduce both the model's storage demands and computational burden. However, advanced binary architectures still incorporate millions of inefficient and nonhardware-friendly full-precision multiplication operations. A&B BNN is proposed to directly remove part of the multiplication operations in a traditional BNN and replace the rest with an equal number of bit operations, introducing the mask layer and the quantized RPReLU structure based on the normalizer-free network architecture. The mask layer can be removed during inference by leveraging the intrinsic characteristics of BNN with straightforward mathematical transformations to avoid the associated multiplication operations. The quantized RPReLU structure enables more efficient bit operations by constraining its slope to be integer powers of 2. Experimental results achieved 92.30%, 69.35%, and 66.89% on the CIFAR-10, CIFAR-100, and ImageNet datasets, respectively, which are competitive with the state-of-the-art. Ablation studies have verified the efficacy of the quantized RPReLU structure, leading to a 1.14% enhancement on the ImageNet compared to using a fixed slope RLeakyReLU. The proposed add&bit-operation-only BNN offers an innovative approach for hardware-friendly network architecture.
- [776] arXiv:2403.03741 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: SUPClust: Active Learning at the BoundariesComments: Accepted at ICLR 2024 Workshop on Practical Machine Learning for Low Resource Settings (PML4LRS)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Active learning is a machine learning paradigm designed to optimize model performance in a setting where labeled data is expensive to acquire. In this work, we propose a novel active learning method called SUPClust that seeks to identify points at the decision boundary between classes. By targeting these points, SUPClust aims to gather information that is most informative for refining the model's prediction of complex decision regions. We demonstrate experimentally that labeling these points leads to strong model performance. This improvement is observed even in scenarios characterized by strong class imbalance.
- [777] arXiv:2403.03750 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth DatasetComments: 11 pages, 2 figures, 7 tables, conference: Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy, May 20-25, 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The advent of Large Language Models (LLMs) has led to remarkable progress on a wide range of natural language processing tasks. Despite the advances, these large-sized models still suffer from hallucinating information in their output, which poses a major issue in automatic text summarization, as we must guarantee that the generated summary is consistent with the content of the source document. Previous research addresses the challenging task of detecting hallucinations in the output (i.e. inconsistency detection) in order to evaluate the faithfulness of the generated summaries. However, these works primarily focus on English and recent multilingual approaches lack German data. This work presents absinth, a manually annotated dataset for hallucination detection in German news summarization and explores the capabilities of novel open-source LLMs on this task in both fine-tuning and in-context learning settings. We open-source and release the absinth dataset to foster further research on hallucination detection in German.
- [778] arXiv:2403.03777 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: ENOT: Expectile Regularization for Fast and Accurate Training of Neural Optimal TransportSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We present a new extension for Neural Optimal Transport (NOT) training procedure, capable of accurately and efficiently estimating optimal transportation plan via specific regularisation on conjugate potentials. The main bottleneck of existing NOT solvers is associated with the procedure of finding a near-exact approximation of the conjugate operator (i.e., the c-transform), which is done either by optimizing over maximin objectives or by the computationally-intensive fine-tuning of the initial approximated prediction. We resolve both issues by proposing a new, theoretically justified loss in the form of expectile regularization that enforces binding conditions on the learning dual potentials. Such a regularization provides the upper bound estimation over the distribution of possible conjugate potentials and makes the learning stable, eliminating the need for additional extensive finetuning. We formally justify the efficiency of our method, called Expectile-Regularised Neural Optimal Transport (ENOT). ENOT outperforms previous state-of-the-art approaches on the Wasserstein-2 benchmark tasks by a large margin (up to a 3-fold improvement in quality and up to a 10-fold improvement in runtime).
- [779] arXiv:2403.03781 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Neural Architecture Search using Particle Swarm and Ant Colony OptimizationJournal-ref: Proceedings of The 28th Irish Conference on Artificial Intelligence and Cognitive Science. 2771. CEUR-WS, 2020Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Neural network models have a number of hyperparameters that must be chosen along with their architecture. This can be a heavy burden on a novice user, choosing which architecture and what values to assign to parameters. In most cases, default hyperparameters and architectures are used. Significant improvements to model accuracy can be achieved through the evaluation of multiple architectures. A process known as Neural Architecture Search (NAS) may be applied to automatically evaluate a large number of such architectures. A system integrating open source tools for Neural Architecture Search (OpenNAS), in the classification of images, has been developed as part of this research. OpenNAS takes any dataset of grayscale, or RBG images, and generates Convolutional Neural Network (CNN) architectures based on a range of metaheuristics using either an AutoKeras, a transfer learning or a Swarm Intelligence (SI) approach. Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) are used as the SI algorithms. Furthermore, models developed through such metaheuristics may be combined using stacking ensembles. In the context of this paper, we focus on training and optimizing CNNs using the Swarm Intelligence (SI) components of OpenNAS. Two major types of SI algorithms, namely PSO and ACO, are compared to see which is more effective in generating higher model accuracies. It is shown, with our experimental design, that the PSO algorithm performs better than ACO. The performance improvement of PSO is most notable with a more complex dataset. As a baseline, the performance of fine-tuned pre-trained models is also evaluated.
- [780] arXiv:2403.03791 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: KG-TREAT: Pre-training for Treatment Effect Estimation by Synergizing Patient Data with Knowledge GraphsComments: AAAI 2024 Main TrackSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Treatment effect estimation (TEE) is the task of determining the impact of various treatments on patient outcomes. Current TEE methods fall short due to reliance on limited labeled data and challenges posed by sparse and high-dimensional observational patient data. To address the challenges, we introduce a novel pre-training and fine-tuning framework, KG-TREAT, which synergizes large-scale observational patient data with biomedical knowledge graphs (KGs) to enhance TEE. Unlike previous approaches, KG-TREAT constructs dual-focus KGs and integrates a deep bi-level attention synergy method for in-depth information fusion, enabling distinct encoding of treatment-covariate and outcome-covariate relationships. KG-TREAT also incorporates two pre-training tasks to ensure a thorough grounding and contextualization of patient data and KGs. Evaluation on four downstream TEE tasks shows KG-TREAT's superiority over existing methods, with an average improvement of 7% in Area under the ROC Curve (AUC) and 9% in Influence Function-based Precision of Estimating Heterogeneous Effects (IF-PEHE). The effectiveness of our estimated treatment effects is further affirmed by alignment with established randomized clinical trial findings.
- [781] arXiv:2403.03808 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Confidence-Aware Decision-Making and Control for Tool SelectionSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Self-reflecting about our performance (e.g., how confident we are) before doing a task is essential for decision making, such as selecting the most suitable tool or choosing the best route to drive. While this form of awareness -- thinking about our performance or metacognitive performance -- is well-known in humans, robots still lack this cognitive ability. This reflective monitoring can enhance their embodied decision power, robustness and safety. Here, we take a step in this direction by introducing a mathematical framework that allows robots to use their control self-confidence to make better-informed decisions. We derive a mathematical closed-form expression for control confidence for dynamic systems (i.e., the posterior inverse covariance of the control action). This control confidence seamlessly integrates within an objective function for decision making, that balances the: i) performance for task completion, ii) control effort, and iii) self-confidence. To evaluate our theoretical account, we framed the decision-making within the tool selection problem, where the agent has to select the best robot arm for a particular control task. The statistical analysis of the numerical simulations with randomized 2DOF arms shows that using control confidence during tool selection improves both real task performance, and the reliability of the tool for performance under unmodelled perturbations (e.g., external forces). Furthermore, our results indicate that control confidence is an early indicator of performance and thus, it can be used as a heuristic for making decisions when computation power is restricted or decision-making is intractable. Overall, we show the advantages of using confidence-aware decision-making and control scheme for dynamic systems.
- [782] arXiv:2403.03812 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ProbSAINT: Probabilistic Tabular Regression for Used Car PricingComments: 9 pages, 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Used car pricing is a critical aspect of the automotive industry, influenced by many economic factors and market dynamics. With the recent surge in online marketplaces and increased demand for used cars, accurate pricing would benefit both buyers and sellers by ensuring fair transactions. However, the transition towards automated pricing algorithms using machine learning necessitates the comprehension of model uncertainties, specifically the ability to flag predictions that the model is unsure about. Although recent literature proposes the use of boosting algorithms or nearest neighbor-based approaches for swift and precise price predictions, encapsulating model uncertainties with such algorithms presents a complex challenge. We introduce ProbSAINT, a model that offers a principled approach for uncertainty quantification of its price predictions, along with accurate point predictions that are comparable to state-of-the-art boosting techniques. Furthermore, acknowledging that the business prefers pricing used cars based on the number of days the vehicle was listed for sale, we show how ProbSAINT can be used as a dynamic forecasting model for predicting price probabilities for different expected offer duration. Our experiments further indicate that ProbSAINT is especially accurate on instances where it is highly certain. This proves the applicability of its probabilistic predictions in real-world scenarios where trustworthiness is crucial.
- [783] arXiv:2403.03814 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just English (e.g. Llama2, Mistral) or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages. Therefore, in this paper, we investigate the basic multilingual capabilities of state-of-the-art open LLMs beyond their intended use. For this purpose, we introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions across a typologically diverse set of 137 languages. With MultiQ, we evaluate language fidelity, i.e.\ whether models respond in the prompted language, and question answering accuracy. All LLMs we test respond faithfully and/or accurately for at least some languages beyond their intended use. Most models are more accurate when they respond faithfully. However, differences across models are large, and there is a long tail of languages where models are neither accurate nor faithful. We explore differences in tokenization as a potential explanation for our findings, identifying possible correlations that warrant further investigation.
- [784] arXiv:2403.03835 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Cobweb: An Incremental and Hierarchical Model of Human-Like Category LearningComments: Accepted by CogSci-24Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Cobweb, a human-like category learning system, differs from most cognitive science models in incrementally constructing hierarchically organized tree-like structures guided by the category utility measure. Prior studies have shown that Cobweb can capture psychological effects such as basic-level, typicality, and fan effects. However, a broader evaluation of Cobweb as a model of human categorization remains lacking. The current study addresses this gap. It establishes Cobweb's alignment with classical human category learning effects. It also explores Cobweb's flexibility to exhibit both exemplar- and prototype-like learning within a single framework. These findings set the stage for further research on Cobweb as a robust model of human category learning.
- [785] arXiv:2403.03852 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Accelerating Convergence of Score-Based Diffusion Models, ProvablyComments: The first two authors contributed equallySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract: Score-based diffusion models, while achieving remarkable empirical performance, often suffer from low sampling speed, due to extensive function evaluations needed during the sampling phase. Despite a flurry of recent activities towards speeding up diffusion generative modeling in practice, theoretical underpinnings for acceleration techniques remain severely limited. In this paper, we design novel training-free algorithms to accelerate popular deterministic (i.e., DDIM) and stochastic (i.e., DDPM) samplers. Our accelerated deterministic sampler converges at a rate $O(1/{T}^2)$ with $T$ the number of steps, improving upon the $O(1/T)$ rate for the DDIM sampler; and our accelerated stochastic sampler converges at a rate $O(1/T)$, outperforming the rate $O(1/\sqrt{T})$ for the DDPM sampler. The design of our algorithms leverages insights from higher-order approximation, and shares similar intuitions as popular high-order ODE solvers like the DPM-Solver-2. Our theory accommodates $\ell_2$-accurate score estimates, and does not require log-concavity or smoothness on the target distribution.
- [786] arXiv:2403.03864 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal ReasoningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles to encompass a diverse array of mathematical and algorithmic topics such as boolean logic, combinatorics, graph theory, optimization, search, etc., aiming to evaluate the gap between visual data interpretation and algorithmic problem-solving skills. The dataset is generated automatically from code authored by humans. All our puzzles have exact solutions that can be found from the algorithm without tedious human calculations. It ensures that our dataset can be scaled up arbitrarily in terms of reasoning complexity and dataset size. Our investigation reveals that large language models (LLMs) such as GPT4V and Gemini exhibit limited performance in puzzle-solving tasks. We find that their performance is near random in a multi-choice question-answering setup for a significant number of puzzles. The findings emphasize the challenges of integrating visual, language, and algorithmic knowledge for solving complex reasoning problems.
- [787] arXiv:2403.03874 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Impoverished Language Technology: The Lack of (Social) Class in NLPComments: Accepted to LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Since Labov's (1964) foundational work on the social stratification of language, linguistics has dedicated concerted efforts towards understanding the relationships between socio-demographic factors and language production and perception. Despite the large body of evidence identifying significant relationships between socio-demographic factors and language production, relatively few of these factors have been investigated in the context of NLP technology. While age and gender are well covered, Labov's initial target, socio-economic class, is largely absent. We survey the existing Natural Language Processing (NLP) literature and find that only 20 papers even mention socio-economic status. However, the majority of those papers do not engage with class beyond collecting information of annotator-demographics. Given this research lacuna, we provide a definition of class that can be operationalised by NLP researchers, and argue for including socio-economic class in future language technologies.
- [788] arXiv:2403.03879 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Redefining cystoscopy with ai: bladder cancer diagnosis using an efficient hybrid cnn-transformer modelMeryem Amaouche , Ouassim Karrakchou , Mounir Ghogho , Anouar El Ghazzaly , Mohamed Alami , Ahmed AmeurComments: 7 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Bladder cancer ranks within the top 10 most diagnosed cancers worldwide and is among the most expensive cancers to treat due to the high recurrence rates which require lifetime follow-ups. The primary tool for diagnosis is cystoscopy, which heavily relies on doctors' expertise and interpretation. Therefore, annually, numerous cases are either undiagnosed or misdiagnosed and treated as urinary infections. To address this, we suggest a deep learning approach for bladder cancer detection and segmentation which combines CNNs with a lightweight positional-encoding-free transformer and dual attention gates that fuse self and spatial attention for feature enhancement. The architecture suggested in this paper is efficient making it suitable for medical scenarios that require real time inference. Experiments have proven that this model addresses the critical need for a balance between computational efficiency and diagnostic accuracy in cystoscopic imaging as despite its small size it rivals large models in performance.
- [789] arXiv:2403.03881 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Latent Dataset Distillation with Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The efficacy of machine learning has traditionally relied on the availability of increasingly larger datasets. However, large datasets pose storage challenges and contain non-influential samples, which could be ignored during training without impacting the final accuracy of the model. In response to these limitations, the concept of distilling the information on a dataset into a condensed set of (synthetic) samples, namely a distilled dataset, emerged. One crucial aspect is the selected architecture (usually ConvNet) for linking the original and synthetic datasets. However, the final accuracy is lower if the employed model architecture differs from the model used during distillation. Another challenge is the generation of high-resolution images, e.g., 128x128 and higher. In this paper, we propose Latent Dataset Distillation with Diffusion Models (LD3M) that combine diffusion in latent space with dataset distillation to tackle both challenges. LD3M incorporates a novel diffusion process tailored for dataset distillation, which improves the gradient norms for learning synthetic images. By adjusting the number of diffusion steps, LD3M also offers a straightforward way of controlling the trade-off between speed and accuracy. We evaluate our approach in several ImageNet subsets and for high-resolution images (128x128 and 256x256). As a result, LD3M consistently outperforms state-of-the-art distillation techniques by up to 4.8 p.p. and 4.2 p.p. for 1 and 10 images per class, respectively.
- [790] arXiv:2403.03890 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic ManipulationComments: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2024). Videos and code: this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: This paper introduces Hierarchical Diffusion Policy (HDP), a hierarchical agent for multi-task robotic manipulation. HDP factorises a manipulation policy into a hierarchical structure: a high-level task-planning agent which predicts a distant next-best end-effector pose (NBP), and a low-level goal-conditioned diffusion policy which generates optimal motion trajectories. The factorised policy representation allows HDP to tackle both long-horizon task planning while generating fine-grained low-level actions. To generate context-aware motion trajectories while satisfying robot kinematics constraints, we present a novel kinematics-aware goal-conditioned control agent, Robot Kinematics Diffuser (RK-Diffuser). Specifically, RK-Diffuser learns to generate both the end-effector pose and joint position trajectories, and distill the accurate but kinematics-unaware end-effector pose diffuser to the kinematics-aware but less accurate joint position diffuser via differentiable kinematics. Empirically, we show that HDP achieves a significantly higher success rate than the state-of-the-art methods in both simulation and real-world.
- [791] arXiv:2403.03893 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: From One to Many: Expanding the Scope of Toxicity Mitigation in Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. We also compare finetuning mitigation approaches against retrieval-augmented techniques under both static and continual toxicity mitigation scenarios. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation. We also explore how model size and data quantity affect the success of these mitigation efforts. Covering nine languages, our study represents a broad array of linguistic families and levels of resource availability, ranging from high to mid-resource languages. Through comprehensive experiments, we provide insights into the complexities of multilingual toxicity mitigation, offering valuable insights and paving the way for future research in this increasingly important field. Code and data are available at this https URL .
- [792] arXiv:2403.03925 (cross-list from q-bio.NC) [ pdf , ps , html , other ]
-
Title: Consciousness qua Mortal ComputationSubjects: Neurons and Cognition (q-bio.NC) ; Artificial Intelligence (cs.AI)
Abstract: Computational functionalism posits that consciousness is a computation. Here we show, perhaps surprisingly, that it cannot be a Turing computation. Rather, computational functionalism implies that consciousness is a novel type of computation that has recently been proposed by Geoffrey Hinton, called mortal computation.
- [793] arXiv:2403.03929 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Extreme Precipitation Nowcasting using Transformer-based Generative ModelsCristian Meo , Ankush Roy , Mircea Lică , Junzhe Yin , Zeineb Bou Che , Yanbo Wang , Ruben Imhoff , Remko Uijlenhoet , Justin DauwelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents an innovative approach to extreme precipitation nowcasting by employing Transformer-based generative models, namely NowcastingGPT with Extreme Value Loss (EVL) regularization. Leveraging a comprehensive dataset from the Royal Netherlands Meteorological Institute (KNMI), our study focuses on predicting short-term precipitation with high accuracy. We introduce a novel method for computing EVL without assuming fixed extreme representations, addressing the limitations of current models in capturing extreme weather events. We present both qualitative and quantitative analyses, demonstrating the superior performance of the proposed NowcastingGPT-EVL in generating accurate precipitation forecasts, especially when dealing with extreme precipitation events. The code is available at \url{ this https URL }.
- [794] arXiv:2403.03949 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust ManipulationMarcel Torne , Anthony Simeonov , Zechu Li , April Chan , Tao Chen , Abhishek Gupta , Pulkit AgrawalComments: Project page: this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Imitation learning methods need significant human supervision to learn policies robust to changes in object poses, physical disturbances, and visual distractors. Reinforcement learning, on the other hand, can explore the environment autonomously to learn robust behaviors but may require impractical amounts of unsafe real-world data collection. To learn performant, robust policies without the burden of unsafe real-world data collection or extensive human supervision, we propose RialTo, a system for robustifying real-world imitation learning policies via reinforcement learning in "digital twin" simulation environments constructed on the fly from small amounts of real-world data. To enable this real-to-sim-to-real pipeline, RialTo proposes an easy-to-use interface for quickly scanning and constructing digital twins of real-world environments. We also introduce a novel "inverse distillation" procedure for bringing real-world demonstrations into simulated environments for efficient fine-tuning, with minimal human intervention and engineering required. We evaluate RialTo across a variety of robotic manipulation problems in the real world, such as robustly stacking dishes on a rack, placing books on a shelf, and six other tasks. RialTo increases (over 67%) in policy robustness without requiring extensive human data collection. Project website and videos at this https URL
- [795] arXiv:2403.03950 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Stop Regressing: Training Value Functions via Classification for Scalable Deep RLJesse Farebrother , Jordi Orbay , Quan Vuong , Adrien Ali Taïga , Yevgen Chebotar , Ted Xiao , Alex Irpan , Sergey Levine , Pablo Samuel Castro , Aleksandra Faust , Aviral Kumar , Rishabh AgarwalSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.
- [796] arXiv:2403.03962 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Identify Critical Nodes in Complex Network with Large Language ModelsSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Identifying critical nodes in networks is a classical decision-making task, and many methods struggle to strike a balance between adaptability and utility. Therefore, we propose an approach that empowers Evolutionary Algorithm (EA) with Large Language Models (LLMs), to generate a function called "score\_nodes" which can further be used to identify crucial nodes based on their assigned scores. Our model consists of three main components: Manual Initialization, Population Management, and LLMs-based Evolution. It evolves from initial populations with a set of designed node scoring functions created manually. LLMs leverage their strong contextual understanding and rich programming skills to perform crossover and mutation operations on the individuals, generating excellent new functions. These functions are then categorized, ranked, and eliminated to ensure the stable development of the populations while preserving diversity. Extensive experiments demonstrate the excellent performance of our method, showcasing its strong generalization ability compared to other state-of-the-art algorithms. It can consistently and orderly generate diverse and efficient node scoring functions. All source codes and models that can reproduce all results in this work are publicly available at this link: \url{https://anonymous.4open.science/r/LLM4CN-6520}
- [797] arXiv:2403.03993 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Personalized Negative Reservoir for Incremental Learning in Recommender SystemsSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Recommender systems have become an integral part of online platforms. Every day the volume of training data is expanding and the number of user interactions is constantly increasing. The exploration of larger and more expressive models has become a necessary pursuit to improve user experience. However, this progression carries with it an increased computational burden. In commercial settings, once a recommendation system model has been trained and deployed it typically needs to be updated frequently as new client data arrive. Cumulatively, the mounting volume of data is guaranteed to eventually make full batch retraining of the model from scratch computationally infeasible. Naively fine-tuning solely on the new data runs into the well-documented problem of catastrophic forgetting. Despite the fact that negative sampling is a crucial part of training with implicit feedback, no specialized technique exists that is tailored to the incremental learning framework. In this work, we take the first step to propose, a personalized negative reservoir strategy which is used to obtain negative samples for the standard triplet loss. This technique balances alleviation of forgetting with plasticity by encouraging the model to remember stable user preferences and selectively forget when user interests change. We derive the mathematical formulation of a negative sampler to populate and update the reservoir. We integrate our design in three SOTA and commonly used incremental recommendation models. We show that these concrete realizations of our negative reservoir framework achieve state-of-the-art results in standard benchmarks, on multiple standard top-k evaluation metrics.
- [798] arXiv:2403.04001 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Bidirectional Progressive Neural Networks with Episodic Return Progress for Emergent Task Sequencing and Robotic Skill TransferComments: 9 pages, 5 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Human brain and behavior provide a rich venue that can inspire novel control and learning methods for robotics. In an attempt to exemplify such a development by inspiring how humans acquire knowledge and transfer skills among tasks, we introduce a novel multi-task reinforcement learning framework named Episodic Return Progress with Bidirectional Progressive Neural Networks (ERP-BPNN). The proposed ERP-BPNN model (1) learns in a human-like interleaved manner by (2) autonomous task switching based on a novel intrinsic motivation signal and, in contrast to existing methods, (3) allows bidirectional skill transfer among tasks. ERP-BPNN is a general architecture applicable to several multi-task learning settings; in this paper, we present the details of its neural architecture and show its ability to enable effective learning and skill transfer among morphologically different robots in a reaching task. The developed Bidirectional Progressive Neural Network (BPNN) architecture enables bidirectional skill transfer without requiring incremental training and seamlessly integrates with online task arbitration. The task arbitration mechanism developed is based on soft Episodic Return progress (ERP), a novel intrinsic motivation (IM) signal. To evaluate our method, we use quantifiable robotics metrics such as 'expected distance to goal' and 'path straightness' in addition to the usual reward-based measure of episodic return common in reinforcement learning. With simulation experiments, we show that ERP-BPNN achieves faster cumulative convergence and improves performance in all metrics considered among morphologically different robots compared to the baselines.
- [799] arXiv:2403.04014 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: PromptCharm: Text-to-Image Generation through Multi-modal Prompting and RefinementComments: To appear in the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USASubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: The recent advancements in Generative AI have significantly advanced the field of text-to-image generation. The state-of-the-art text-to-image model, Stable Diffusion, is now capable of synthesizing high-quality images with a strong sense of aesthetics. Crafting text prompts that align with the model's interpretation and the user's intent thus becomes crucial. However, prompting remains challenging for novice users due to the complexity of the stable diffusion model and the non-trivial efforts required for iteratively editing and refining the text prompts. To address these challenges, we propose PromptCharm, a mixed-initiative system that facilitates text-to-image creation through multi-modal prompt engineering and refinement. To assist novice users in prompting, PromptCharm first automatically refines and optimizes the user's initial prompt. Furthermore, PromptCharm supports the user in exploring and selecting different image styles within a large database. To assist users in effectively refining their prompts and images, PromptCharm renders model explanations by visualizing the model's attention values. If the user notices any unsatisfactory areas in the generated images, they can further refine the images through model attention adjustment or image inpainting within the rich feedback loop of PromptCharm. To evaluate the effectiveness and usability of PromptCharm, we conducted a controlled user study with 12 participants and an exploratory user study with another 12 participants. These two studies show that participants using PromptCharm were able to create images with higher quality and better aligned with the user's expectations compared with using two variants of PromptCharm that lacked interaction or visualization support.
- [800] arXiv:2403.04015 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Knockoff-Guided Feature Selection via A Single Pre-trained Reinforced AgentSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Feature selection prepares the AI-readiness of data by eliminating redundant features. Prior research falls into two primary categories: i) Supervised Feature Selection, which identifies the optimal feature subset based on their relevance to the target variable; ii) Unsupervised Feature Selection, which reduces the feature space dimensionality by capturing the essential information within the feature set instead of using target variable. However, SFS approaches suffer from time-consuming processes and limited generalizability due to the dependence on the target variable and downstream ML tasks. UFS methods are constrained by the deducted feature space is latent and untraceable. To address these challenges, we introduce an innovative framework for feature selection, which is guided by knockoff features and optimized through reinforcement learning, to identify the optimal and effective feature subset. In detail, our method involves generating "knockoff" features that replicate the distribution and characteristics of the original features but are independent of the target variable. Each feature is then assigned a pseudo label based on its correlation with all the knockoff features, serving as a novel metric for feature evaluation. Our approach utilizes these pseudo labels to guide the feature selection process in 3 novel ways, optimized by a single reinforced agent: 1). A deep Q-network, pre-trained with the original features and their corresponding pseudo labels, is employed to improve the efficacy of the exploration process in feature selection. 2). We introduce unsupervised rewards to evaluate the feature subset quality based on the pseudo labels and the feature space reconstruction loss to reduce dependencies on the target variable. 3). A new {\epsilon}-greedy strategy is used, incorporating insights from the pseudo labels to make the feature selection process more effective.
- [801] arXiv:2403.04031 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Can Large Language Models do Analytical Reasoning?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. Our analytical reasoning embodies the tasks of letting large language models count how many points each team scores in a quarter in the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find among all the models we employed, GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. Specifically, we compare three different prompting techniques and a divide-and-conquer approach, we find that the latter was the most effective. Our divide-and-conquer approach breaks down play-by-play data into smaller, more manageable segments, solves each piece individually, and then aggregates them together. Besides the divide-and-conquer approach, we also explore the Chain of Thought (CoT) strategy, which markedly improves outcomes for certain models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing significantly. However, the CoT strategy has negligible or even detrimental effects on the performance of other models like GPT-3.5 and Gemini-Pro. Secondly, to our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores. This leads us to further investigate the factors that impact the complexity of analytical reasoning tasks with extensive experiments, through which we conclude that task complexity depends on the length of context, the information density, and the presence of related information. Our research provides valuable insights into the complexity of analytical reasoning tasks and potential directions for developing future large language models.
- [802] arXiv:2403.04033 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Online Learning with Unknown ConstraintsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
Abstract: We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round. We provide a general meta-algorithm that leverages an online regression oracle to estimate the unknown safety constraint, and converts the predictions of an online learning oracle to predictions that adhere to the unknown safety constraint. On the theoretical side, our algorithm's regret can be bounded by the regret of the online regression and online learning oracles, the eluder dimension of the model class containing the unknown safety constraint, and a novel complexity measure that captures the difficulty of safe learning. We complement our result with an asymptotic lower bound that shows that the aforementioned complexity measure is necessary. When the constraints are linear, we instantiate our result to provide a concrete algorithm with $\sqrt{T}$ regret using a scaling transformation that balances optimistic exploration with pessimistic constraint satisfaction.
- [803] arXiv:2403.04036 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Unsupervised Contrastive Learning for Robust RF Device Fingerprinting Under Time-Domain ShiftComments: 6 pages, 5 figures, accepted by 2024 IEEE International Conference on Communications (ICC)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Radio Frequency (RF) device fingerprinting has been recognized as a potential technology for enabling automated wireless device identification and classification. However, it faces a key challenge due to the domain shift that could arise from variations in the channel conditions and environmental settings, potentially degrading the accuracy of RF-based device classification when testing and training data is collected in different domains. This paper introduces a novel solution that leverages contrastive learning to mitigate this domain shift problem. Contrastive learning, a state-of-the-art self-supervised learning approach from deep learning, learns a distance metric such that positive pairs are closer (i.e. more similar) in the learned metric space than negative pairs. When applied to RF fingerprinting, our model treats RF signals from the same transmission as positive pairs and those from different transmissions as negative pairs. Through experiments on wireless and wired RF datasets collected over several days, we demonstrate that our contrastive learning approach captures domain-invariant features, diminishing the effects of domain-specific variations. Our results show large and consistent improvements in accuracy (10.8\% to 27.8\%) over baseline models, thus underscoring the effectiveness of contrastive learning in improving device classification under domain shift.
- [804] arXiv:2403.04070 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Improving Adversarial Training using Vulnerability-Aware Perturbation BudgetComments: 19 pages, 2 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Adversarial Training (AT) effectively improves the robustness of Deep Neural Networks (DNNs) to adversarial attacks. Generally, AT involves training DNN models with adversarial examples obtained within a pre-defined, fixed perturbation bound. Notably, individual natural examples from which these adversarial examples are crafted exhibit varying degrees of intrinsic vulnerabilities, and as such, crafting adversarial examples with fixed perturbation radius for all instances may not sufficiently unleash the potency of AT. Motivated by this observation, we propose two simple, computationally cheap vulnerability-aware reweighting functions for assigning perturbation bounds to adversarial examples used for AT, named Margin-Weighted Perturbation Budget (MWPB) and Standard-Deviation-Weighted Perturbation Budget (SDWPB). The proposed methods assign perturbation radii to individual adversarial samples based on the vulnerability of their corresponding natural examples. Experimental results show that the proposed methods yield genuine improvements in the robustness of AT algorithms against various adversarial attacks.
- [805] arXiv:2403.04071 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: On-device Self-supervised Learning of Visual Perception Tasks aboard Hardware-limited Nano-quadrotorsComments: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Sub-\SI{50}{\gram} nano-drones are gaining momentum in both academia and industry. Their most compelling applications rely on onboard deep learning models for perception despite severe hardware constraints (\ie sub-\SI{100}{\milli\watt} processor). When deployed in unknown environments not represented in the training data, these models often underperform due to domain shift. To cope with this fundamental problem, we propose, for the first time, on-device learning aboard nano-drones, where the first part of the in-field mission is dedicated to self-supervised fine-tuning of a pre-trained convolutional neural network (CNN). Leveraging a real-world vision-based regression task, we thoroughly explore performance-cost trade-offs of the fine-tuning phase along three axes: \textit{i}) dataset size (more data increases the regression performance but requires more memory and longer computation); \textit{ii}) methodologies (\eg fine-tuning all model parameters vs. only a subset); and \textit{iii}) self-supervision strategy. Our approach demonstrates an improvement in mean absolute error up to 30\% compared to the pre-trained baseline, requiring only \SI{22}{\second} fine-tuning on an ultra-low-power GWT GAP9 System-on-Chip. Addressing the domain shift problem via on-device learning aboard nano-drones not only marks a novel result for hardware-limited robots but lays the ground for more general advancements for the entire robotics community.
- [806] arXiv:2403.04073 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Semi-Supervised Dialogue Abstractive Summarization via High-Quality Pseudolabel SelectionComments: 21 pages, 10 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Semi-supervised dialogue summarization (SSDS) leverages model-generated summaries to reduce reliance on human-labeled data and improve the performance of summarization models. While addressing label noise, previous works on semi-supervised learning primarily focus on natural language understanding tasks, assuming each sample has a unique label. However, these methods are not directly applicable to SSDS, as it is a generative task, and each dialogue can be summarized in different ways. In this work, we propose a novel scoring approach, SiCF, which encapsulates three primary dimensions of summarization model quality: Semantic invariance (indicative of model confidence), Coverage (factual recall), and Faithfulness (factual precision). Using the SiCF score, we select unlabeled dialogues with high-quality generated summaries to train summarization models. Comprehensive experiments on three public datasets demonstrate the effectiveness of SiCF scores in uncertainty estimation and semi-supervised learning for dialogue summarization tasks. Our code is available at \url{ this https URL }.
- [807] arXiv:2403.04115 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: DNAct: Diffusion Guided Multi-Task 3D Policy LearningSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications to challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Project website: this http URL .
- [808] arXiv:2403.04146 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated LearningJournal-ref: Data Science and Engineering (2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated learning (FL) is a promising approach for learning a model from data distributed on massive clients without exposing data privacy. It works effectively in the ideal federation where clients share homogeneous data distribution and learning behavior. However, FL may fail to function appropriately when the federation is not ideal, amid an unhealthy state called Negative Federated Learning (NFL), in which most clients gain no benefit from participating in FL. Many studies have tried to address NFL. However, their solutions either (1) predetermine to prevent NFL in the entire learning life-cycle or (2) tackle NFL in the aftermath of numerous learning rounds. Thus, they either (1) indiscriminately incur extra costs even if FL can perform well without such costs or (2) waste numerous learning rounds. Additionally, none of the previous work takes into account the clients who may be unwilling/unable to follow the proposed NFL solutions when using those solutions to upgrade an FL system in use. This paper introduces FL-GUARD, a holistic framework that can be employed on any FL system for tackling NFL in a run-time paradigm. That is, to dynamically detect NFL at the early stage (tens of rounds) of learning and then to activate recovery measures when necessary. Specifically, we devise a cost-effective NFL detection mechanism, which relies on an estimation of performance gain on clients. Only when NFL is detected, we activate the NFL recovery process, in which each client learns in parallel an adapted model when training the global model. Extensive experiment results confirm the effectiveness of FL-GUARD in detecting NFL and recovering from NFL to a healthy learning state. We also show that FL-GUARD is compatible with previous NFL solutions and robust against clients unwilling/unable to take any recovery measures.
- [809] arXiv:2403.04158 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DA-Net: A Disentangled and Adaptive Network for Multi-Source Cross-Lingual Transfer LearningComments: AAAI 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Multi-Source cross-lingual transfer learning deals with the transfer of task knowledge from multiple labelled source languages to an unlabeled target language under the language shift. Existing methods typically focus on weighting the predictions produced by language-specific classifiers of different sources that follow a shared encoder. However, all source languages share the same encoder, which is updated by all these languages. The extracted representations inevitably contain different source languages' information, which may disturb the learning of the language-specific classifiers. Additionally, due to the language gap, language-specific classifiers trained with source labels are unable to make accurate predictions for the target language. Both facts impair the model's performance. To address these challenges, we propose a Disentangled and Adaptive Network (DA-Net). Firstly, we devise a feedback-guided collaborative disentanglement method that seeks to purify input representations of classifiers, thereby mitigating mutual interference from multiple sources. Secondly, we propose a class-aware parallel adaptation method that aligns class-level distributions for each source-target language pair, thereby alleviating the language pairs' language gap. Experimental results on three different tasks involving 38 languages validate the effectiveness of our approach.
- [810] arXiv:2403.04160 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Improving Retrieval in Theme-specific Applications using a Corpus Topical TaxonomyComments: TheWebConf'24Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Document retrieval has greatly benefited from the advancements of large-scale pre-trained language models (PLMs). However, their effectiveness is often limited in theme-specific applications for specialized areas or industries, due to unique terminologies, incomplete contexts of user queries, and specialized search intents. To capture the theme-specific information and improve retrieval, we propose to use a corpus topical taxonomy, which outlines the latent topic structure of the corpus while reflecting user-interested aspects. We introduce ToTER (Topical Taxonomy Enhanced Retrieval) framework, which identifies the central topics of queries and documents with the guidance of the taxonomy, and exploits their topical relatedness to supplement missing contexts. As a plug-and-play framework, ToTER can be flexibly employed to enhance various PLM-based retrievers. Through extensive quantitative, ablative, and exploratory experiments on two real-world datasets, we ascertain the benefits of using topical taxonomy for retrieval in theme-specific applications and demonstrate the effectiveness of ToTER.
- [811] arXiv:2403.04164 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ProMISe: Promptable Medical Image Segmentation using SAMSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: With the proposal of the Segment Anything Model (SAM), fine-tuning SAM for medical image segmentation (MIS) has become popular. However, due to the large size of the SAM model and the significant domain gap between natural and medical images, fine-tuning-based strategies are costly with potential risk of instability, feature damage and catastrophic forgetting. Furthermore, some methods of transferring SAM to a domain-specific MIS through fine-tuning strategies disable the model's prompting capability, severely limiting its utilization scenarios. In this paper, we propose an Auto-Prompting Module (APM), which provides SAM-based foundation model with Euclidean adaptive prompts in the target domain. Our experiments demonstrate that such adaptive prompts significantly improve SAM's non-fine-tuned performance in MIS. In addition, we propose a novel non-invasive method called Incremental Pattern Shifting (IPS) to adapt SAM to specific medical domains. Experimental results show that the IPS enables SAM to achieve state-of-the-art or competitive performance in MIS without the need for fine-tuning. By coupling these two methods, we propose ProMISe, an end-to-end non-fine-tuned framework for Promptable Medical Image Segmentation. Our experiments demonstrate that both using our methods individually or in combination achieves satisfactory performance in low-cost pattern shifting, with all of SAM's parameters frozen.
- [812] arXiv:2403.04175 (cross-list from physics.med-ph) [ pdf , ps , other ]
-
Title: Understanding the PULSAR Effect in Combined Radiotherapy and Immunotherapy through Attention Mechanisms with a Transformer ModelSubjects: Medical Physics (physics.med-ph) ; Artificial Intelligence (cs.AI)
Abstract: PULSAR (personalized, ultra-fractionated stereotactic adaptive radiotherapy) is the adaptation of stereotactic ablative radiotherapy towards personalized cancer management. For the first time, we applied a transformer-based attention mechanism to investigate the underlying interactions between combined PULSAR and PD-L1 blockade immunotherapy based on a murine cancer model (Lewis Lung Carcinoma, LLC). The proposed approach is able to predict the trend of tumor volume change semi-quantitatively, and excels in identifying the potential causal relationships through both self-attention and cross-attention scores.
- [813] arXiv:2403.04182 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Metric-aware LLM inference for regression and scoringComments: 15 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have demonstrated strong results on a range of NLP tasks. Typically, outputs are obtained via autoregressive sampling from the LLM's underlying distribution. Building on prior work on Minimum Bayes Risk Decoding, we show that this inference strategy can be suboptimal for a range of regression and scoring tasks, and associated evaluation metrics. As a remedy, we propose metric aware LLM inference: a decision theoretic approach optimizing for custom regression and scoring metrics at inference time. We report improvements over baselines on academic benchmarks and publicly available models.
- [814] arXiv:2403.04187 (cross-list from physics.bio-ph) [ pdf , ps , html , other ]
-
Title: Preference optimization of protein language models as a multi-objective binder design paradigmComments: Published at the GEM workshop, ICLR 2024. Generative and Experimental Perspectives for Biomolecular Design ( this https URL )Subjects: Biological Physics (physics.bio-ph) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM)
Abstract: We present a multi-objective binder design paradigm based on instruction fine-tuning and direct preference optimization (DPO) of autoregressive protein language models (pLMs). Multiple design objectives are encoded in the language model through direct optimization on expert curated preference sequence datasets comprising preferred and dispreferred distributions. We show the proposed alignment strategy enables ProtGPT2 to effectively design binders conditioned on specified receptors and a drug developability criterion. Generated binder samples demonstrate median isoelectric point (pI) improvements by $17\%-60\%$.
- [815] arXiv:2403.04190 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generative AI for Synthetic Data Generation: Methods, Challenges and the FutureSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The recent surge in research focused on generating synthetic data from large language models (LLMs), especially for scenarios with limited data availability, marks a notable shift in Generative Artificial Intelligence (AI). Their ability to perform comparably to real-world data positions this approach as a compelling solution to low-resource challenges. This paper delves into advanced technologies that leverage these gigantic LLMs for the generation of task-specific training data. We outline methodologies, evaluation techniques, and practical applications, discuss the current limitations, and suggest potential pathways for future research.
- [816] arXiv:2403.04197 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Large Language Models are In-Context Molecule LearnersSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve informative context examples. Additionally, we also propose Post-retrieval Re-ranking with Sequence Reversal and Random Walk to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context molecule learning capability of LLMs with retrieved examples and adapts the parameters of LLMs for the molecule-caption translation task. Experimental results demonstrate that ICMT can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.
- [817] arXiv:2403.04202 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Dynamics of Moral Behavior in Heterogeneous Populations of Learning AgentsSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Growing concerns about safety and alignment of AI systems highlight the importance of embedding moral capabilities in artificial agents. A promising solution is the use of learning from experience, i.e., Reinforcement Learning. In multi-agent (social) environments, complex population-level phenomena may emerge from interactions between individual learning agents. Many of the existing studies rely on simulated social dilemma environments to study the interactions of independent learning agents. However, they tend to ignore the moral heterogeneity that is likely to be present in societies of agents in practice. For example, at different points in time a single learning agent may face opponents who are consequentialist (i.e., caring about maximizing some outcome over time) or norm-based (i.e., focusing on conforming to a specific norm here and now). The extent to which agents' co-development may be impacted by such moral heterogeneity in populations is not well understood. In this paper, we present a study of the learning dynamics of morally heterogeneous populations interacting in a social dilemma setting. Using a Prisoner's Dilemma environment with a partner selection mechanism, we investigate the extent to which the prevalence of diverse moral agents in populations affects individual agents' learning behaviors and emergent population-level outcomes. We observe several types of non-trivial interactions between pro-social and anti-social agents, and find that certain classes of moral agents are able to steer selfish agents towards more cooperative behavior.
- [818] arXiv:2403.04221 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Why Online Reinforcement Learning is CausalComments: 27 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Reinforcement learning (RL) and causal modelling naturally complement each other. The goal of causal modelling is to predict the effects of interventions in an environment, while the goal of reinforcement learning is to select interventions that maximize the rewards the agent receives from the environment. Reinforcement learning includes the two most powerful sources of information for estimating causal relationships: temporal ordering and the ability to act on an environment. This paper examines which reinforcement learning settings we can expect to benefit from causal modelling, and how. In online learning, the agent has the ability to interact directly with their environment, and learn from exploring it. Our main argument is that in online learning, conditional probabilities are causal, and therefore offline RL is the setting where causal learning has the most potential to make a difference. Essentially, the reason is that when an agent learns from their {\em own} experience, there are no unobserved confounders that influence both the agent's own exploratory actions and the rewards they receive. Our paper formalizes this argument. For offline RL, where an agent may and typically does learn from the experience of {\em others}, we describe previous and new methods for leveraging a causal model, including support for counterfactual queries.
- [819] arXiv:2403.04224 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Aligners: Decoupling LLMs and AlignmentComments: Tiny Papers Track at the International Conference on Learning Representations (ICLR) 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance. Our recipe for training the aligner models solely relies on synthetic data generated with a (prompted) LLM and can be easily adjusted for a variety of alignment criteria. We illustrate our method by training an "ethical" aligner and verify its efficacy empirically.
- [820] arXiv:2403.04232 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Generalizing Cooperative Eco-driving via Multi-residual Task LearningComments: Accepted for publication at ICRA 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Abstract: Conventional control, such as model-based control, is commonly utilized in autonomous driving due to its efficiency and reliability. However, real-world autonomous driving contends with a multitude of diverse traffic scenarios that are challenging for these planning algorithms. Model-free Deep Reinforcement Learning (DRL) presents a promising avenue in this direction, but learning DRL control policies that generalize to multiple traffic scenarios is still a challenge. To address this, we introduce Multi-residual Task Learning (MRTL), a generic learning framework based on multi-task learning that, for a set of task scenarios, decomposes the control into nominal components that are effectively solved by conventional control methods and residual terms which are solved using learning. We employ MRTL for fleet-level emission reduction in mixed traffic using autonomous vehicles as a means of system control. By analyzing the performance of MRTL across nearly 600 signalized intersections and 1200 traffic scenarios, we demonstrate that it emerges as a promising approach to synergize the strengths of DRL and conventional methods in generalizable control.
- [821] arXiv:2403.04233 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DEEP-ICL: Definition-Enriched Experts for Language Model In-Context LearningXingwei Qu , Yiming Liang , Yucheng Wang , Tianyu Zheng , Tommy Yue , Lei Ma , Stephen W. Huang , Jiajun Zhang , Wenhu Chen , Chenghua Lin , Jie Fu , Ge ZhangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: It has long been assumed that the sheer number of parameters in large language models (LLMs) drives in-context learning (ICL) capabilities, enabling remarkable performance improvements by leveraging task-specific demonstrations. Challenging this hypothesis, we introduce DEEP-ICL, a novel task Definition Enriched ExPert Ensembling methodology for ICL. DEEP-ICL explicitly extracts task definitions from given demonstrations and generates responses through learning task-specific examples. We argue that improvement from ICL does not directly rely on model size, but essentially stems from understanding task definitions and task-guided learning. Inspired by this, DEEP-ICL combines two 3B models with distinct roles (one for concluding task definitions and the other for learning task demonstrations) and achieves comparable performance to LLaMA2-13B. Furthermore, our framework outperforms conventional ICL by overcoming pretraining sequence length limitations, by supporting unlimited demonstrations. We contend that DEEP-ICL presents a novel alternative for achieving efficient few-shot learning, extending beyond the conventional ICL.
- [822] arXiv:2403.04246 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Efficient CNN-LSTM based Parameter Estimation of Levy Driven Stochastic Differential EquationsComments: 2023 International Conference on Machine Learning and Applications (ICMLA)Subjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study addresses the challenges in parameter estimation of stochastic differential equations driven by non-Gaussian noises, which are critical in understanding dynamic phenomena such as price fluctuations and the spread of infectious diseases. Previous research highlighted the potential of LSTM networks in estimating parameters of alpha stable Levy driven SDEs but faced limitations including high time complexity and constraints of the LSTM chaining property. To mitigate these issues, we introduce the PEnet, a novel CNN-LSTM-based three-stage model that offers an end to end approach with superior accuracy and adaptability to varying data structures, enhanced inference speed for long sequence observations through initial data feature condensation by CNN, and high generalization capability, allowing its application to various complex SDE scenarios. Experiments on synthetic datasets confirm PEnet significant advantage in estimating SDE parameters associated with noise characteristics, establishing it as a competitive method for SDE parameter estimation in the presence of Levy noise.
- [823] arXiv:2403.04256 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Federated Recommendation via Hybrid Retrieval Augmented GenerationSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Federated Recommendation (FR) emerges as a novel paradigm that enables privacy-preserving recommendations. However, traditional FR systems usually represent users/items with discrete identities (IDs), suffering from performance degradation due to the data sparsity and heterogeneity in FR. On the other hand, Large Language Models (LLMs) as recommenders have proven effective across various recommendation scenarios. Yet, LLM-based recommenders encounter challenges such as low inference efficiency and potential hallucination, compromising their performance in real-world scenarios. To this end, we propose GPT-FedRec, a federated recommendation framework leveraging ChatGPT and a novel hybrid Retrieval Augmented Generation (RAG) mechanism. GPT-FedRec is a two-stage solution. The first stage is a hybrid retrieval process, mining ID-based user patterns and text-based item features. Next, the retrieved results are converted into text prompts and fed into GPT for re-ranking. Our proposed hybrid retrieval mechanism and LLM-based re-rank aims to extract generalized features from data and exploit pretrained knowledge within LLM, overcoming data sparsity and heterogeneity in FR. In addition, the RAG approach also prevents LLM hallucination, improving the recommendation performance for real-world users. Experimental results on diverse benchmark datasets demonstrate the superior performance of GPT-FedRec against state-of-the-art baseline methods.
- [824] arXiv:2403.04283 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with ProxyYu Zhu , Chuxiong Sun , Wenfei Yang , Wenqiang Wei , Bo Tang , Tianzhu Zhang , Zhiyu Li , Shifeng Zhang , Feiyu Xiong , Jie Hu , Mingchuan yangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure Large Language Models (LLMs) align with human values. However, existing RLHF methods require a high computational cost, one main reason being that RLHF assigns both the generation and alignment tasks to the LLM simultaneously. In this paper, we introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs, achieving alignment with human values at a much lower computational cost. We start with a novel Markov Decision Process (MDP) designed for the alignment process and employ Reinforcement Learning (RL) to train a streamlined proxy model that oversees the token generation of the LLM, without altering the LLM itself. Experiments show that our method achieves a comparable level of alignment with only 1\% of the training parameters of other methods.
- [825] arXiv:2403.04299 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: LitSim: A Conflict-aware Policy for Long-term Interactive Traffic SimulationHaojie Xin , Xiaodong Zhang , Renzhi Tang , Songyang Yan , Qianrui Zhao , Chunze Yang , Wen Cui , Zijiang YangComments: 9 pages, 6 figures, under reviewSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Simulation is pivotal in evaluating the performance of autonomous driving systems due to the advantages of high efficiency and low cost compared to on-road testing. Bridging the gap between simulation and the real world requires realistic agent behaviors. However, the existing works have the following shortcomings in achieving this goal: (1) log replay offers realistic scenarios but often leads to collisions due to the absence of dynamic interactions, and (2) both heuristic-based and data-based solutions, which are parameterized and trained on real-world datasets, encourage interactions but often deviate from real-world data over long horizons. In this work, we propose LitSim, a long-term interactive simulation approach that maximizes realism by minimizing the interventions in the log. Specifically, our approach primarily uses log replay to ensure realism and intervenes only when necessary to prevent potential conflicts. We then encourage interactions among the agents and resolve the conflicts, thereby reducing the risk of unrealistic behaviors. We train and validate our model on the real-world dataset NGSIM, and the experimental results demonstrate that LitSim outperforms the currently popular approaches in terms of realism and reactivity.
- [826] arXiv:2403.04306 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Effectiveness Assessment of Recent Large Vision-Language ModelsYao Jiang , Xinyu Yan , Ge-Peng Ji , Keren Fu , Meijun Sun , Huan Xiong , Deng-Ping Fan , Fahad Shahbaz KhanSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The advent of large vision-language models (LVLMs) represents a noteworthy advancement towards the pursuit of artificial general intelligence. However, the model efficacy across both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their efficacy in specialized tasks, we employ six challenging tasks across three distinct application scenarios, namely natural, healthcare, and industrial ones. Such six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization under these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope this study could provide useful insights for the future development of LVLMs, helping researchers improve LVLMs to cope with both general and specialized applications.
- [827] arXiv:2403.04309 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AO-DETR: Anti-Overlapping DETR for X-Ray Prohibited Items DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Prohibited item detection in X-ray images is one of the most essential and highly effective methods widely employed in various security inspection scenarios. Considering the significant overlapping phenomenon in X-ray prohibited item images, we propose an Anti-Overlapping DETR (AO-DETR) based on one of the state-of-the-art general object detectors, DINO. Specifically, to address the feature coupling issue caused by overlapping phenomena, we introduce the Category-Specific One-to-One Assignment (CSA) strategy to constrain category-specific object queries in predicting prohibited items of fixed categories, which can enhance their ability to extract features specific to prohibited items of a particular category from the overlapping foreground-background features. To address the edge blurring problem caused by overlapping phenomena, we propose the Look Forward Densely (LFD) scheme, which improves the localization accuracy of reference boxes in mid-to-high-level decoder layers and enhances the ability to locate blurry edges of the final layer. Similar to DINO, our AO-DETR provides two different versions with distinct backbones, tailored to meet diverse application requirements. Extensive experiments on the PIXray and OPIXray datasets demonstrate that the proposed method surpasses the state-of-the-art object detectors, indicating its potential applications in the field of prohibited item detection. The source code will be released at this https URL .
- [828] arXiv:2403.04321 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Discriminative Probing and Tuning for Text-to-Image GenerationComments: CVPR 2024; project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Abstract: Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.
- [829] arXiv:2403.04325 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Measuring Meaning Composition in the Human Brain with Composition Scores from Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The process of meaning composition, wherein smaller units like morphemes or words combine to form the meaning of phrases and sentences, is essential for human sentence comprehension. Despite extensive neurolinguistic research into the brain regions involved in meaning composition, a computational metric to quantify the extent of composition is still lacking. Drawing on the key-value memory interpretation of transformer feed-forward network blocks, we introduce the Composition Score, a novel model-based metric designed to quantify the degree of meaning composition during sentence comprehension. Experimental findings show that this metric correlates with brain clusters associated with word frequency, structural processing, and general sensitivity to words, suggesting the multifaceted nature of meaning composition during human sentence comprehension.
- [830] arXiv:2403.04326 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Edge-based Parametric Digital Twins for Intelligent Building Indoor Climate ModelingZhongjun Ni (1), Chi Zhang (2), Magnus Karlsson (1), Shaofang Gong (1) ((1) Department of Science and Technology, Linköping University, Campus Norrköping, Norrköping, Sweden. (2) Department of Computer Science and Engineering, University of Gothenburg, Gothenburg, Sweden.)Comments: 8 pages, 8 figures, accepted in the 20th IEEE International Conference on Factory Communication SystemsSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Digital transformation in the built environment generates vast data for developing data-driven models to optimize building operations. This study presents an integrated solution utilizing edge computing, digital twins, and deep learning to enhance the understanding of climate in buildings. Parametric digital twins, created using an ontology, ensure consistent data representation across diverse service systems equipped by different buildings. Based on created digital twins and collected data, deep learning methods are employed to develop predictive models for identifying patterns in indoor climate and providing insights. Both the parametric digital twin and deep learning models are deployed on edge for low latency and privacy compliance. As a demonstration, a case study was conducted in a historic building in Östergötland, Sweden, to compare the performance of five deep learning architectures. The results indicate that the time-series dense encoder model exhibited strong competitiveness in performing multi-horizon forecasts of indoor temperature and relative humidity with low computational costs.
- [831] arXiv:2403.04359 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Symmetry Considerations for Learning Task Symmetric Robot PoliciesComments: M. Mittal and N. Rudin contributed equally. Accepted for ICRA 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Symmetry is a fundamental aspect of many real-world robotic tasks. However, current deep reinforcement learning (DRL) approaches can seldom harness and exploit symmetry effectively. Often, the learned behaviors fail to achieve the desired transformation invariances and suffer from motion artifacts. For instance, a quadruped may exhibit different gaits when commanded to move forward or backward, even though it is symmetrical about its torso. This issue becomes further pronounced in high-dimensional or complex environments, where DRL methods are prone to local optima and fail to explore regions of the state space equally. Past methods on encouraging symmetry for robotic tasks have studied this topic mainly in a single-task setting, where symmetry usually refers to symmetry in the motion, such as the gait patterns. In this paper, we revisit this topic for goal-conditioned tasks in robotics, where symmetry lies mainly in task execution and not necessarily in the learned motions themselves. In particular, we investigate two approaches to incorporate symmetry invariance into DRL -- data augmentation and mirror loss function. We provide a theoretical foundation for using augmented samples in an on-policy setting. Based on this, we show that the corresponding approach achieves faster convergence and improves the learned behaviors in various challenging robotic tasks, from climbing boxes with a quadruped to dexterous manipulation.
- [832] arXiv:2403.04374 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Model-Free Load Frequency Control of Nonlinear Power Systems Based on Deep Reinforcement LearningSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI)
Abstract: Load frequency control (LFC) is widely employed in power systems to stabilize frequency fluctuation and guarantee power quality. However, most existing LFC methods rely on accurate power system modeling and usually ignore the nonlinear characteristics of the system, limiting controllers' performance. To solve these problems, this paper proposes a model-free LFC method for nonlinear power systems based on deep deterministic policy gradient (DDPG) framework. The proposed method establishes an emulator network to emulate power system dynamics. After defining the action-value function, the emulator network is applied for control actions evaluation instead of the critic network. Then the actor network controller is effectively optimized by estimating the policy gradient based on zeroth-order optimization (ZOO) and backpropagation algorithm. Simulation results and corresponding comparisons demonstrate the designed controller can generate appropriate control actions and has strong adaptability for nonlinear power systems.
- [833] arXiv:2403.04382 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Acceleron: A Tool to Accelerate Research IdeationComments: Accepted at AI2ASE Workshop at AAAI'24 Conference. 13 Pages and 4 FiguresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Several tools have recently been proposed for assisting researchers during various stages of the research life-cycle. However, these primarily concentrate on tasks such as retrieving and recommending relevant literature, reviewing and critiquing the draft, and writing of research manuscripts. Our investigation reveals a significant gap in availability of tools specifically designed to assist researchers during the challenging ideation phase of the research life-cycle. To aid with research ideation, we propose `Acceleron', a research accelerator for different phases of the research life cycle, and which is specially designed to aid the ideation process. Acceleron guides researchers through the formulation of a comprehensive research proposal, encompassing a novel research problem. The proposals motivation is validated for novelty by identifying gaps in the existing literature and suggesting a plausible list of techniques to solve the proposed problem. We leverage the reasoning and domain-specific skills of Large Language Models (LLMs) to create an agent-based architecture incorporating colleague and mentor personas for LLMs. The LLM agents emulate the ideation process undertaken by researchers, engaging researchers in an interactive fashion to aid in the development of the research proposal. Notably, our tool addresses challenges inherent in LLMs, such as hallucinations, implements a two-stage aspect-based retrieval to manage precision-recall trade-offs, and tackles issues of unanswerability. As evaluation, we illustrate the execution of our motivation validation and method synthesis workflows on proposals from the ML and NLP domain, given by 3 distinct researchers. Our observations and evaluations provided by the researchers illustrate the efficacy of the tool in terms of assisting researchers with appropriate inputs at distinct stages and thus leading to improved time efficiency.
- [834] arXiv:2403.04417 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Promising and worth-to-try future directions for advancing state-of-the-art surrogates methods of agent-based models in social and health computational sciencesComments: 4 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Dynamical Systems (math.DS)
Abstract: The execution and runtime performance of model-based analysis tools for realistic large-scale ABMs (Agent-Based Models) can be excessively long. This due to the computational demand exponentially proportional to the model size (e.g. Population size) and the number of model parameters. Even the runtime of a single simulation of a realistic ABM may demand huge computational resources when attempting to employ realistic population size. The main aim of this ad-hoc brief report is to highlight some of surrogate models that were adequate and computationally less demanding for nonlinear dynamical models in various modeling application this http URL the author knowledge, these methods have been not, at least extensively, employed for ABMs within the field of (SHCS) Social Health Computational Sciences, yet. Thus, they might be, but not necessarily, useful in progressing state of the art for establishing surrogate models for ABMs in the field of SHCS.
- [835] arXiv:2403.04427 (cross-list from cs.CE) [ pdf , ps , html , other ]
-
Title: Sentiment-driven prediction of financial returns: a Bayesian-enhanced FinBERT approachComments: Version exposed at XXV Workshop on Quantitative Finance Bologna (Italy), April 11-13 2024 (not peer reviewed but accepted for the workshop)Subjects: Computational Engineering, Finance, and Science (cs.CE) ; Artificial Intelligence (cs.AI)
Abstract: Predicting financial returns accurately poses a significant challenge due to the inherent uncertainty in financial time series data. Enhancing prediction models' performance hinges on effectively capturing both social and financial sentiment. In this study, we showcase the efficacy of leveraging sentiment information extracted from tweets using the FinBERT large language model. By meticulously curating an optimal feature set through correlation analysis and employing Bayesian-optimized Recursive Feature Elimination for automatic feature selection, we surpass existing methodologies, achieving an F1-score exceeding 70% on the test set. This success translates into demonstrably higher cumulative profits during backtested trading. Our investigation focuses on real-world SPY ETF data alongside corresponding tweets sourced from the StockTwits platform.
- [836] arXiv:2403.04436 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Learning Human-to-Humanoid Real-Time Whole-Body TeleoperationComments: Project website: this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract: We present Human to Humanoid (H2O), a reinforcement learning (RL) based framework that enables real-time whole-body teleoperation of a full-sized humanoid robot with only an RGB camera. To create a large-scale retargeted motion dataset of human movements for humanoid robots, we propose a scalable "sim-to-data" process to filter and pick feasible motions using a privileged motion imitator. Afterwards, we train a robust real-time humanoid motion imitator in simulation using these refined motions and transfer it to the real humanoid robot in a zero-shot manner. We successfully achieve teleoperation of dynamic whole-body motions in real-world scenarios, including walking, back jumping, kicking, turning, waving, pushing, boxing, etc. To the best of our knowledge, this is the first demonstration to achieve learning-based real-time whole-body humanoid teleoperation.
- [837] arXiv:2403.04442 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Cooperative Bayesian Optimization for Imperfect AgentsJournal-ref: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2023Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: We introduce a cooperative Bayesian optimization problem for optimizing black-box functions of two variables where two agents choose together at which points to query the function but have only control over one variable each. This setting is inspired by human-AI teamwork, where an AI-assistant helps its human user solve a problem, in this simplest case, collaborative optimization. We formulate the solution as sequential decision-making, where the agent we control models the user as a computationally rational agent with prior knowledge about the function. We show that strategic planning of the queries enables better identification of the global maximum of the function as long as the user avoids excessive exploration. This planning is made possible by using Bayes Adaptive Monte Carlo planning and by endowing the agent with a user model that accounts for conservative belief updates and exploratory sampling of the points to query.
- [838] arXiv:2403.04447 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FRRI: a novel algorithm for fuzzy-rough rule inductionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Interpretability is the next frontier in machine learning research. In the search for white box models - as opposed to black box models, like random forests or neural networks - rule induction algorithms are a logical and promising option, since the rules can easily be understood by humans. Fuzzy and rough set theory have been successfully applied to this archetype, almost always separately. As both approaches to rule induction involve granular computing based on the concept of equivalence classes, it is natural to combine them. The QuickRules\cite{JensenCornelis2009} algorithm was a first attempt at using fuzzy rough set theory for rule induction. It is based on QuickReduct, a greedy algorithm for building decision reducts. QuickRules already showed an improvement over other rule induction methods. However, to evaluate the full potential of a fuzzy rough rule induction algorithm, one needs to start from the foundations. In this paper, we introduce a novel rule induction algorithm called Fuzzy Rough Rule Induction (FRRI). We provide background and explain the workings of our algorithm. Furthermore, we perform a computational experiment to evaluate the performance of our algorithm and compare it to other state-of-the-art rule induction approaches. We find that our algorithm is more accurate while creating small rulesets consisting of relatively short rules. We end the paper by outlining some directions for future work.
- [839] arXiv:2403.04454 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Low-Resource Court Judgment Summarization for Common Law SystemsComments: First submitted to Information Processing and Management on Oct. 29, 2023. Major Revision submitted on Mar.6, 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Common law courts need to refer to similar precedents' judgments to inform their current decisions. Generating high-quality summaries of court judgment documents can facilitate legal practitioners to efficiently review previous cases and assist the general public in accessing how the courts operate and how the law is applied. Previous court judgment summarization research focuses on civil law or a particular jurisdiction's judgments. However, judges can refer to the judgments from all common law jurisdictions. Current summarization datasets are insufficient to satisfy the demands of summarizing precedents across multiple jurisdictions, especially when labeled data are scarce for many jurisdictions. To address the lack of datasets, we present CLSum, the first dataset for summarizing multi-jurisdictional common law court judgment documents. Besides, this is the first court judgment summarization work adopting large language models (LLMs) in data augmentation, summary generation, and evaluation. Specifically, we design an LLM-based data augmentation method incorporating legal knowledge. We also propose a legal knowledge enhanced evaluation metric based on LLM to assess the quality of generated judgment summaries. Our experimental results verify that the LLM-based summarization methods can perform well in the few-shot and zero-shot settings. Our LLM-based data augmentation method can mitigate the impact of low data resources. Furthermore, we carry out comprehensive comparative experiments to find essential model components and settings that are capable of enhancing summarization performance.
- [840] arXiv:2403.04468 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Survey of Graph Neural Networks in Real world: Imbalance, Noise, Privacy and OOD ChallengesWei Ju , Siyu Yi , Yifan Wang , Zhiping Xiao , Zhengyang Mao , Hourun Li , Yiyang Gu , Yifang Qin , Nan Yin , Senzhang Wang , Xinwang Liu , Xiao Luo , Philip S. Yu , Ming ZhangSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Abstract: Graph-structured data exhibits universality and widespread applicability across diverse domains, such as social network analysis, biochemistry, financial fraud detection, and network security. Significant strides have been made in leveraging Graph Neural Networks (GNNs) to achieve remarkable success in these areas. However, in real-world scenarios, the training environment for models is often far from ideal, leading to substantial performance degradation of GNN models due to various unfavorable factors, including imbalance in data distribution, the presence of noise in erroneous data, privacy protection of sensitive information, and generalization capability for out-of-distribution (OOD) scenarios. To tackle these issues, substantial efforts have been devoted to improving the performance of GNN models in practical real-world scenarios, as well as enhancing their reliability and robustness. In this paper, we present a comprehensive survey that systematically reviews existing GNN models, focusing on solutions to the four mentioned real-world challenges including imbalance, noise, privacy, and OOD in practical scenarios that many existing reviews have not considered. Specifically, we first highlight the four key challenges faced by existing GNNs, paving the way for our exploration of real-world GNN models. Subsequently, we provide detailed discussions on these four aspects, dissecting how these solutions contribute to enhancing the reliability and robustness of GNN models. Last but not least, we outline promising directions and offer future perspectives in the field.
- [841] arXiv:2403.04473 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TextMonkey: An OCR-Free Large Multimodal Model for Understanding DocumentSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. It also learns to perform screenshot tasks through finetuning. Evaluation on 12 benchmarks shows notable improvements: 5.2% in Scene Text-Centric tasks (including STVQA, TextVQA, and OCRVQA), 6.9% in Document-Oriented tasks (such as DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, and WikiTableQuestions), and 2.8% in Key Information Extraction tasks (comprising FUNSD, SROIE, and POIE). It outperforms in scene text spotting with a 10.9\% increase and sets a new standard on OCRBench, a comprehensive benchmark consisting of 29 OCR-related assessments, with a score of 561, surpassing previous open-sourced large multimodal models for document understanding. Code will be released at this https URL .
- [842] arXiv:2403.04481 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Do Large Language Model Understand Multi-Intent Spoken Language ?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This research signifies a considerable breakthrough in leveraging Large Language Models (LLMs) for multi-intent spoken language understanding (SLU). Our approach re-imagines the use of entity slots in multi-intent SLU applications, making the most of the generative potential of LLMs within the SLU landscape, leading to the development of the EN-LLM series. Furthermore, we introduce the concept of Sub-Intent Instruction (SII) to amplify the analysis and interpretation of complex, multi-intent communications, which further supports the creation of the ENSI-LLM models series. Our novel datasets, identified as LM-MixATIS and LM-MixSNIPS, are synthesized from existing benchmarks. The study evidences that LLMs may match or even surpass the performance of the current best multi-intent SLU models. We also scrutinize the performance of LLMs across a spectrum of intent configurations and dataset distributions. On top of this, we present two revolutionary metrics - Entity Slot Accuracy (ESA) and Combined Semantic Accuracy (CSA) - to facilitate a detailed assessment of LLM competence in this multifaceted field." Our code and datasets are available at \url{ this https URL }.
- [843] arXiv:2403.04500 (cross-list from physics.med-ph) [ pdf , ps , html , other ]
-
Title: A Learnable Prior Improves Inverse Tumor Growth ModelingJonas Weidner , Ivan Ezhov , Michal Balcerak , Marie-Christin Metz , Sergey Litvinov , Sebastian Kaltenbach , Leonhard Feiner , Laurin Lux , Florian Kofler , Jana Lipkova , Jonas Latz , Daniel Rueckert , Bjoern Menze , Benedikt WiestlerComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Medical Physics (physics.med-ph) ; Artificial Intelligence (cs.AI)
Abstract: Biophysical modeling, particularly involving partial differential equations (PDEs), offers significant potential for tailoring disease treatment protocols to individual patients. However, the inverse problem-solving aspect of these models presents a substantial challenge, either due to the high computational requirements of model-based approaches or the limited robustness of deep learning (DL) methods. We propose a novel framework that leverages the unique strengths of both approaches in a synergistic manner. Our method incorporates a DL ensemble for initial parameter estimation, facilitating efficient downstream evolutionary sampling initialized with this DL-based prior. We showcase the effectiveness of integrating a rapid deep-learning algorithm with a high-precision evolution strategy in estimating brain tumor cell concentrations from magnetic resonance images. The DL-Prior plays a pivotal role, significantly constraining the effective sampling-parameter space. This reduction results in a fivefold convergence acceleration and a Dice-score of 95%
- [844] arXiv:2403.04510 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Where does In-context Translation Happen in Large Language ModelsComments: 19 pages. Under ReviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Self-supervised large language models have demonstrated the ability to perform Machine Translation (MT) via in-context learning, but little is known about where the model performs the task with respect to prompt instructions and demonstration examples. In this work, we attempt to characterize the region where large language models transition from in-context learners to translation models. Through a series of layer-wise context-masking experiments on \textsc{GPTNeo2.7B}, \textsc{Bloom3B}, \textsc{Llama7b} and \textsc{Llama7b-chat}, we demonstrate evidence of a "task recognition" point where the translation task is encoded into the input representations and attention to context is no longer necessary. We further observe correspondence between the low performance when masking out entire layers, and the task recognition layers. Taking advantage of this redundancy results in 45\% computational savings when prompting with 5 examples, and task recognition achieved at layer 14 / 32. Our layer-wise fine-tuning experiments indicate that the most effective layers for MT fine-tuning are the layers critical to task recognition.
- [845] arXiv:2403.04523 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision TransformersComments: Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract: The development and adoption of Vision Transformers and other deep-learning architectures for image classification tasks has been rapid. However, the "black box" nature of neural networks is a barrier to adoption in applications where explainability is essential. While some techniques for generating explanations have been proposed, primarily for Convolutional Neural Networks, adapting such techniques to the new paradigm of Vision Transformers is non-trivial. This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations, a general methodology for explaining deep neural networks used in image classification tasks. The proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network, using a streamlined training approach. After training, explanation maps can be computed in a single forward pass; these explanation maps are comparable to or outperform the outputs of computationally expensive perturbation-based explainability techniques, achieving SOTA performance. We apply T-TAME to three popular deep learning classifier architectures, VGG-16, ResNet-50, and ViT-B-16, trained on the ImageNet dataset, and we demonstrate improvements over existing state-of-the-art explainability methods. A detailed analysis of the results and an ablation study provide insights into how the T-TAME design choices affect the quality of the generated explanation maps.
- [846] arXiv:2403.04526 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Hyperspectral unmixing for Raman spectroscopy via physics-constrained autoencodersDimitar Georgiev , Álvaro Fernández-Galiana , Simon Vilms Pedersen , Georgios Papadopoulos , Ruoxiao Xie , Molly M. Stevens , Mauricio BarahonaSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Raman spectroscopy is widely used across scientific domains to characterize the chemical composition of samples in a non-destructive, label-free manner. Many applications entail the unmixing of signals from mixtures of molecular species to identify the individual components present and their proportions, yet conventional methods for chemometrics often struggle with complex mixture scenarios encountered in practice. Here, we develop hyperspectral unmixing algorithms based on autoencoder neural networks, and we systematically validate them using both synthetic and experimental benchmark datasets created in-house. Our results demonstrate that unmixing autoencoders provide improved accuracy, robustness and efficiency compared to standard unmixing methods. We also showcase the applicability of autoencoders to complex biological settings by showing improved biochemical characterization of volumetric Raman imaging data from a monocytic cell.
- [847] arXiv:2403.04529 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Enhancing Data Quality in Federated Fine-Tuning of Foundation ModelsComments: Accepted at ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models (DPFM)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among multiple specialized and high-quality private domain data sources. However, the challenge of training models locally without sharing private data presents numerous obstacles in data quality control. To tackle this issue, we propose a data quality control pipeline for federated fine-tuning of foundation models. This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance. Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
- [848] arXiv:2403.04547 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?Ibrahim Alabdulmohsin , Xiao Wang , Andreas Steiner , Priya Goyal , Alexander D'Amour , Xiaohua ZhaiComments: 32 pages, 20 figures, 7 tablesJournal-ref: ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.
- [849] arXiv:2403.04558 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Reducing self-supervised learning complexity improves weakly-supervised classification performance in computational pathologyComments: Submitted to MICCAI 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Deep Learning models have been successfully utilized to extract clinically actionable insights from routinely available histology data. Generally, these models require annotations performed by clinicians, which are scarce and costly to generate. The emergence of self-supervised learning (SSL) methods remove this barrier, allowing for large-scale analyses on non-annotated data. However, recent SSL approaches apply increasingly expansive model architectures and larger datasets, causing the rapid escalation of data volumes, hardware prerequisites, and overall expenses, limiting access to these resources to few institutions. Therefore, we investigated the complexity of contrastive SSL in computational pathology in relation to classification performance with the utilization of consumer-grade hardware. Specifically, we analyzed the effects of adaptations in data volume, architecture, and algorithms on downstream classification tasks, emphasizing their impact on computational resources. We trained breast cancer foundation models on a large public patient cohort and validated them on various downstream classification tasks in a weakly supervised manner on two external public patient cohorts. Our experiments demonstrate that we can improve downstream classification performance whilst reducing SSL training duration by 90%. In summary, we propose a set of adaptations which enable the utilization of SSL in computational pathology in non-resource abundant environments.
- [850] arXiv:2403.04612 (cross-list from eess.IV) [ pdf , ps , other ]
-
Title: A Domain Translation Framework with an Adversarial Denoising Diffusion Model to Generate Synthetic Datasets of Echocardiography ImagesSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Currently, medical image domain translation operations show a high demand from researchers and clinicians. Amongst other capabilities, this task allows the generation of new medical images with sufficiently high image quality, making them clinically relevant. Deep Learning (DL) architectures, most specifically deep generative models, are widely used to generate and translate images from one domain to another. The proposed framework relies on an adversarial Denoising Diffusion Model (DDM) to synthesize echocardiography images and perform domain translation. Contrary to Generative Adversarial Networks (GANs), DDMs are able to generate high quality image samples with a large diversity. If a DDM is combined with a GAN, this ability to generate new data is completed at an even faster sampling time. In this work we trained an adversarial DDM combined with a GAN to learn the reverse denoising process, relying on a guide image, making sure relevant anatomical structures of each echocardiography image were kept and represented on the generated image samples. For several domain translation operations, the results verified that such generative model was able to synthesize high quality image samples: MSE: 11.50 +/- 3.69, PSNR (dB): 30.48 +/- 0.09, SSIM: 0.47 +/- 0.03. The proposed method showed high generalization ability, introducing a framework to create echocardiography images suitable to be used for clinical research purposes.
- [851] arXiv:2403.04629 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Explaining Bayesian Optimization by Shapley Values Facilitates Human-AI CollaborationJulian Rodemann , Federico Croppi , Philipp Arens , Yusuf Sale , Julia Herbinger , Bernd Bischl , Eyke Hüllermeier , Thomas Augustin , Conor J. Walsh , Giuseppe CasalicchioComments: Preprint. Copyright by the authors. 19 pages, 24 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Machine Learning (stat.ML)
Abstract: Bayesian optimization (BO) with Gaussian processes (GP) has become an indispensable algorithm for black box optimization problems. Not without a dash of irony, BO is often considered a black box itself, lacking ways to provide reasons as to why certain parameters are proposed to be evaluated. This is particularly relevant in human-in-the-loop applications of BO, such as in robotics. We address this issue by proposing ShapleyBO, a framework for interpreting BO's proposals by game-theoretic Shapley values.They quantify each parameter's contribution to BO's acquisition function. Exploiting the linearity of Shapley values, we are further able to identify how strongly each parameter drives BO's exploration and exploitation for additive acquisition functions like the confidence bound. We also show that ShapleyBO can disentangle the contributions to exploration into those that explore aleatoric and epistemic uncertainty. Moreover, our method gives rise to a ShapleyBO-assisted human machine interface (HMI), allowing users to interfere with BO in case proposals do not align with human reasoning. We demonstrate this HMI's benefits for the use case of personalizing wearable robotic devices (assistive back exosuits) by human-in-the-loop BO. Results suggest human-BO teams with access to ShapleyBO can achieve lower regret than teams without.
- [852] arXiv:2403.04634 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Pix2Gif: Motion-Guided Diffusion for GIF GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts, as shown in teaser fig. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs. Code, dataset and models are made public at: this https URL .
- [853] arXiv:2403.04650 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Context-Based Multimodal FusionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The fusion models, which effectively combine information from different sources, are widely used in solving multimodal tasks. However, they have significant limitations related to aligning data distributions across different modalities. This challenge can lead to inconsistencies and difficulties in learning robust representations. Alignment models, while specifically addressing this issue, often require training "from scratch" with large datasets to achieve optimal results, which can be costly in terms of resources and time. To overcome these limitations, we propose an innovative model called Context-Based Multimodal Fusion (CBMF), which combines both modality fusion and data distribution alignment. In CBMF, each modality is represented by a specific context vector, fused with the embedding of each modality. This enables the use of large pre-trained models that can be frozen, reducing the computational and training data requirements. Additionally, the network learns to differentiate embeddings of different modalities through fusion with context and aligns data distributions using a contrastive approach for self-supervised learning. Thus, CBMF offers an effective and economical solution for solving complex multimodal tasks.
- [854] arXiv:2403.04652 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Yi: Open Foundation Models by 01.AI01.AI : Alex Young , Bei Chen , Chao Li , Chengen Huang , Ge Zhang , Guanwei Zhang , Heng Li , Jiangcheng Zhu , Jianqun Chen , Jing Chang , Kaidong Yu , Peng Liu , Qiang Liu , Shawn Yue , Senbin Yang , Shiming Yang , Tao Yu , Wen Xie , Wenhao Huang , Xiaohui Hu , Xiaoyi Ren , Xinyao Niu , Pengcheng Nie , Yuchi Xu , Yudong Liu , Yue Wang , Yuxuan Cai , Zhenyu Gu , Zhiyuan Liu , Zonghong DaiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models.
- [855] arXiv:2403.04690 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock LevelComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision latency compared to existing naive kernels for 1-D and 2-D neighborhood attention respectively. We find certain inherent inefficiencies in all unfused neighborhood attention kernels that bound their performance and lower-precision scalability. We also developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision latency. We observe that our fused kernels successfully circumvent some of the unavoidable inefficiencies in unfused implementations. While our unfused GEMM-based kernels only improve half precision performance compared to naive kernels by an average of 496% and 113% in 1-D and 2-D problems respectively, our fused kernels improve naive kernels by an average of 1607% and 581% in 1-D and 2-D problems respectively.
- [856] arXiv:2403.04696 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Fact-Checking the Output of Large Language Models via Token-Level Uncertainty QuantificationEkaterina Fadeeva , Aleksandr Rubashevskii , Artem Shelmanov , Sergey Petrakov , Haonan Li , Hamdy Mubarak , Evgenii Tsymbalov , Gleb Kuzmin , Alexander Panchenko , Timothy Baldwin , Preslav Nakov , Maxim PanovSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factual, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for six different LLMs and three languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.
- [857] arXiv:2403.04697 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit DetectorsComments: 19 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. Existing methods suffer from overfitting issues due to the utilization of a large number of learnable parameters on scarce AU-annotated datasets or heavy reliance on substantial additional relevant data. Parameter-Efficient Transfer Learning (PETL) provides a promising paradigm to address these challenges, whereas its existing methods lack design for AU characteristics. Therefore, we innovatively investigate PETL paradigm to AU detection, introducing AUFormer and proposing a novel Mixture-of-Knowledge Expert (MoKE) collaboration mechanism. An individual MoKE specific to a certain AU with minimal learnable parameters first integrates personalized multi-scale and correlation knowledge. Then the MoKE collaborates with other MoKEs in the expert group to obtain aggregated information and inject it into the frozen Vision Transformer (ViT) to achieve parameter-efficient AU detection. Additionally, we design a Margin-truncated Difficulty-aware Weighted Asymmetric Loss (MDWA-Loss), which can encourage the model to focus more on activated AUs, differentiate the difficulty of unactivated AUs, and discard potential mislabeled samples. Extensive experiments from various perspectives, including within-domain, cross-domain, data efficiency, and micro-expression domain, demonstrate AUFormer's state-of-the-art performance and robust generalization abilities without relying on additional relevant data. The code for AUFormer is available at this https URL .
- [858] arXiv:2403.04701 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional ChangesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiment to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks. Code this https URL
- [859] arXiv:2403.04706 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Common 7B Language Models Already Possess Strong Math CapabilitiesChen Li , Weiqi Wang , Jingcheng Hu , Yixuan Wei , Nanning Zheng , Han Hu , Zheng Zhang , Houwen PengSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Mathematical capabilities were previously believed to emerge in common language models only at a very large scale or require extensive math-related pre-training. This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities, as evidenced by its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. The primary issue with the current base model is the difficulty in consistently eliciting its inherent mathematical capabilities. Notably, the accuracy for the first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks, respectively. We find that simply scaling up the SFT data can significantly enhance the reliability of generating correct answers. However, the potential for extensive scaling is constrained by the scarcity of publicly available math questions. To overcome this limitation, we employ synthetic data, which proves to be nearly as effective as real data and shows no clear saturation when scaled up to approximately one million samples. This straightforward approach achieves an accuracy of 82.6% on GSM8K and 40.6% on MATH using LLaMA-2 7B models, surpassing previous models by 14.2% and 20.8%, respectively. We also provide insights into scaling behaviors across different reasoning complexities and error types.
- [860] arXiv:2403.04746 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LLMs in the Imaginarium: Tool Learning through Simulated Trial and ErrorComments: Code and data available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.
- [861] arXiv:2403.04747 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: GNN-VPA: A Variance-Preserving Aggregation Strategy for Graph Neural NetworksLisa Schneckenreiter , Richard Freinschlag , Florian Sestak , Johannes Brandstetter , Günter Klambauer , Andreas MayrComments: Accepted at ICLR 2024 (Tiny Papers Track)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Graph neural networks (GNNs), and especially message-passing neural networks, excel in various domains such as physics, drug discovery, and molecular modeling. The expressivity of GNNs with respect to their ability to discriminate non-isomorphic graphs critically depends on the functions employed for message aggregation and graph-level readout. By applying signal propagation theory, we propose a variance-preserving aggregation function (VPA) that maintains expressivity, but yields improved forward and backward dynamics. Experiments demonstrate that VPA leads to increased predictive performance for popular GNN architectures as well as improved learning dynamics. Our results could pave the way towards normalizer-free or self-normalizing GNNs.
- [862] arXiv:2403.04758 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: KnowledgeVIS: Interpreting Language Models by Comparing Fill-in-the-Blank PromptsComments: Accepted to IEEE TVCG. 20 pages, 10 figures, 1 table. For a demo video, see this https URL . For a live demo, visit this https URL . The source code is available at this https URLSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Recent growth in the popularity of large language models has led to their increased usage for summarizing, predicting, and generating text, making it vital to help researchers and engineers understand how and why they work. We present KnowledgeVis, a human-in-the-loop visual analytics system for interpreting language models using fill-in-the-blank sentences as prompts. By comparing predictions between sentences, KnowledgeVis reveals learned associations that intuitively connect what language models learn during training to natural language tasks downstream, helping users create and test multiple prompt variations, analyze predicted words using a novel semantic clustering technique, and discover insights using interactive visualizations. Collectively, these visualizations help users identify the likelihood and uniqueness of individual predictions, compare sets of predictions between prompts, and summarize patterns and relationships between predictions across all prompts. We demonstrate the capabilities of KnowledgeVis with feedback from six NLP experts as well as three different use cases: (1) probing biomedical knowledge in two domain-adapted models; and (2) evaluating harmful identity stereotypes and (3) discovering facts and relationships between three general-purpose models.
- [863] arXiv:2403.04760 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: iScore: Visual Analytics for Interpreting How Language Models Automatically Score SummariesComments: Accepted to IUI 2024. 16 pages, 5 figures, 1 table. For a demo video, see this https URL . For a live demo, visit this https URL . The source code is available at this https URLSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: The recent explosion in popularity of large language models (LLMs) has inspired learning engineers to incorporate them into adaptive educational tools that automatically score summary writing. Understanding and evaluating LLMs is vital before deploying them in critical learning environments, yet their unprecedented size and expanding number of parameters inhibits transparency and impedes trust when they underperform. Through a collaborative user-centered design process with several learning engineers building and deploying summary scoring LLMs, we characterized fundamental design challenges and goals around interpreting their models, including aggregating large text inputs, tracking score provenance, and scaling LLM interpretability methods. To address their concerns, we developed iScore, an interactive visual analytics tool for learning engineers to upload, score, and compare multiple summaries simultaneously. Tightly integrated views allow users to iteratively revise the language in summaries, track changes in the resulting LLM scores, and visualize model weights at multiple levels of abstraction. To validate our approach, we deployed iScore with three learning engineers over the course of a month. We present a case study where interacting with iScore led a learning engineer to improve their LLM's score accuracy by three percentage points. Finally, we conducted qualitative interviews with the learning engineers that revealed how iScore enabled them to understand, evaluate, and build trust in their LLMs during deployment.
- [864] arXiv:2403.04769 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Using Hallucinations to Bypass GPT4's FilterSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) are initially trained on vast amounts of data, then fine-tuned using reinforcement learning from human feedback (RLHF); this also serves to teach the LLM to provide appropriate and safe responses. In this paper, we present a novel method to manipulate the fine-tuned version into reverting to its pre-RLHF behavior, effectively erasing the model's filters; the exploit currently works for GPT4, Claude Sonnet, and (to some extent) for Inflection-2.5. Unlike other jailbreaks (for example, the popular "Do Anything Now" (DAN) ), our method does not rely on instructing the LLM to override its RLHF policy; hence, simply modifying the RLHF process is unlikely to address it. Instead, we induce a hallucination involving reversed text during which the model reverts to a word bucket, effectively pausing the model's filter. We believe that our exploit presents a fundamental vulnerability in LLMs currently unaddressed, as well as an opportunity to better understand the inner workings of LLMs during hallucinations.
- [865] arXiv:2403.04775 (cross-list from cs.LO) [ pdf , ps , html , other ]
-
Title: Superposition with Delayed UnificationComments: 16 pages, 0 figures, 1 tableJournal-ref: International Conference on Automated Deduction (CADE) 2023. LNAI volume 14132, 2023, pp. 23-40Subjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI)
Abstract: Classically, in saturation-based proof systems, unification has been considered atomic. However, it is also possible to move unification to the calculus level, turning the steps of the unification algorithm into inferences. For calculi that rely on unification procedures returning large or even infinite sets of unifiers, integrating unification into the calculus is an attractive method of dovetailing unification and inference. This applies, for example, to AC-superposition and higher-order superposition. We show that first-order superposition remains complete when moving unification rules to the calculus level. We discuss some of the benefits this has even for standard first-order superposition and provide an experimental evaluation.
- [866] arXiv:2403.04780 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MuseGraph: Graph-oriented Instruction Tuning of Large Language Models for Generic Graph MiningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Graphs with abundant attributes are essential in modeling interconnected entities and improving predictions in various real-world applications. Traditional Graph Neural Networks (GNNs), which are commonly used for modeling attributed graphs, need to be re-trained every time when applied to different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced a new paradigm in natural language processing, the generative potential of LLMs in graph mining remains largely under-explored. To this end, we propose a novel framework MuseGraph, which seamlessly integrates the strengths of GNNs and LLMs and facilitates a more effective and generic approach for graph mining across different tasks and datasets. Specifically, we first introduce a compact graph description via the proposed adaptive input generation to encapsulate key information from the graph under the constraints of language token limitations. Then, we propose a diverse instruction generation mechanism, which distills the reasoning capabilities from LLMs (e.g., GPT-4) to create task-specific Chain-of-Thought-based instruction packages for different graph tasks. Finally, we propose a graph-aware instruction tuning with a dynamic instruction package allocation strategy across tasks and datasets, ensuring the effectiveness and generalization of the training process. Our experimental results demonstrate significant improvements in different graph tasks, showcasing the potential of our MuseGraph in enhancing the accuracy of graph-oriented downstream tasks while keeping the generation powers of LLMs.
- [867] arXiv:2403.04782 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Survey on Temporal Knowledge Graph: Representation Learning and ApplicationsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Knowledge graphs have garnered significant research attention and are widely used to enhance downstream applications. However, most current studies mainly focus on static knowledge graphs, whose facts do not change with time, and disregard their dynamic evolution over time. As a result, temporal knowledge graphs have attracted more attention because a large amount of structured knowledge exists only within a specific period. Knowledge graph representation learning aims to learn low-dimensional vector embeddings for entities and relations in a knowledge graph. The representation learning of temporal knowledge graphs incorporates time information into the standard knowledge graph framework and can model the dynamics of entities and relations over time. In this paper, we conduct a comprehensive survey of temporal knowledge graph representation learning and its applications. We begin with an introduction to the definitions, datasets, and evaluation metrics for temporal knowledge graph representation learning. Next, we propose a taxonomy based on the core technologies of temporal knowledge graph representation learning methods, and provide an in-depth analysis of different methods in each category. Finally, we present various downstream applications related to the temporal knowledge graphs. In the end, we conclude the paper and have an outlook on the future research directions in this area.
- [868] arXiv:2403.04785 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Large Language Multimodal Models for 5-Year Chronic Disease Cohort Prediction Using EHR DataJun-En Ding , Phan Nguyen Minh Thao , Wen-Chih Peng , Jian-Zhe Wang , Chun-Cheng Chug , Min-Chen Hsieh , Yun-Chien Tseng , Ling Chen , Dongsheng Luo , Chi-Te Wang , Pei-fu Chen , Feng Liu , Fang-Ming HungSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Chronic diseases such as diabetes are the leading causes of morbidity and mortality worldwide. Numerous research studies have been attempted with various deep learning models in diagnosis. However, most previous studies had certain limitations, including using publicly available datasets (e.g. MIMIC), and imbalanced data. In this study, we collected five-year electronic health records (EHRs) from the Taiwan hospital database, including 1,420,596 clinical notes, 387,392 laboratory test results, and more than 1,505 laboratory test items, focusing on research pre-training large language models. We proposed a novel Large Language Multimodal Models (LLMMs) framework incorporating multimodal data from clinical notes and laboratory test results for the prediction of chronic disease risk. Our method combined a text embedding encoder and multi-head attention layer to learn laboratory test values, utilizing a deep neural network (DNN) module to merge blood features with chronic disease semantics into a latent space. In our experiments, we observe that clinicalBERT and PubMed-BERT, when combined with attention fusion, can achieve an accuracy of 73% in multiclass chronic diseases and diabetes prediction. By transforming laboratory test values into textual descriptions and employing the Flan T-5 model, we achieved a 76% Area Under the ROC Curve (AUROC), demonstrating the effectiveness of leveraging numerical text data for training and inference in language models. This approach significantly improves the accuracy of early-stage diabetes prediction.
- [869] arXiv:2403.04787 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Ever-Evolving Memory by Blending and Refining the PastComments: 17 pages, 4 figures, 7 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: For a human-like chatbot, constructing a long-term memory is crucial. However, current large language models often lack this capability, leading to instances of missing important user information or redundantly asking for the same information, thereby diminishing conversation quality. To effectively construct memory, it is crucial to seamlessly connect past and present information, while also possessing the ability to forget obstructive information. To address these challenges, we propose CREEM, a novel memory system for long-term conversation. Improving upon existing approaches that construct memory based solely on current sessions, CREEM blends past memories during memory formation. Additionally, we introduce a refining process to handle redundant or outdated information. Unlike traditional paradigms, we view responding and memory construction as inseparable tasks. The blending process, which creates new memories, also serves as a reasoning step for response generation by informing the connection between past and present. Through evaluation, we demonstrate that CREEM enhances both memory and response qualities in multi-session personalized dialogues.
- [870] arXiv:2403.04789 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: TopicDiff: A Topic-enriched Diffusion Approach for Multimodal Conversational Emotion DetectionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Multimodal Conversational Emotion (MCE) detection, generally spanning across the acoustic, vision and language modalities, has attracted increasing interest in the multimedia community. Previous studies predominantly focus on learning contextual information in conversations with only a few considering the topic information in single language modality, while always neglecting the acoustic and vision topic information. On this basis, we propose a model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks. Particularly, we integrate the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information. Detailed evaluations demonstrate the significant improvements of TopicDiff over the state-of-the-art MCE baselines, justifying the importance of multimodal topic information to MCE and the effectiveness of TopicDiff in capturing such information. Furthermore, we observe an interesting finding that the topic information in acoustic and vision is more discriminative and robust compared to the language.
- [871] arXiv:2403.04790 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Online Training of Large Language Models: Learn while chattingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models(LLMs) have dramatically revolutionized the field of Natural Language Processing(NLP), offering remarkable capabilities that have garnered widespread usage. However, existing interaction paradigms between LLMs and users are constrained by either inflexibility, limitations in customization, or a lack of persistent learning. This inflexibility is particularly evident as users, especially those without programming skills, have restricted avenues to enhance or personalize the model. Existing frameworks further complicate the model training and deployment process due to their computational inefficiencies and lack of user-friendly interfaces. To overcome these challenges, this paper introduces a novel interaction paradigm-'Online Training using External Interactions'-that merges the benefits of persistent, real-time model updates with the flexibility for individual customization through external interactions such as AI agents or online/offline knowledge bases.
- [872] arXiv:2403.04793 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A Data-Driven Two-Phase Multi-Split Causal Ensemble Model for Time SeriesZhipeng Ma , Marco Kemmerling , Daniel Buschmann , Chrismarie Enslin , Daniel Lütticke , Robert H. SchmittJournal-ref: Symmetry 2023, 15(5), 982Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Methodology (stat.ME)
Abstract: Causal inference is a fundamental research topic for discovering the cause-effect relationships in many disciplines. However, not all algorithms are equally well-suited for a given dataset. For instance, some approaches may only be able to identify linear relationships, while others are applicable for non-linearities. Algorithms further vary in their sensitivity to noise and their ability to infer causal information from coupled vs. non-coupled time series. Therefore, different algorithms often generate different causal relationships for the same input. To achieve a more robust causal inference result, this publication proposes a novel data-driven two-phase multi-split causal ensemble model to combine the strengths of different causality base algorithms. In comparison to existing approaches, the proposed ensemble method reduces the influence of noise through a data partitioning scheme in the first phase. To achieve this, the data are initially divided into several partitions and the base algorithms are applied to each partition. Subsequently, Gaussian mixture models are used to identify the causal relationships derived from the different partitions that are likely to be valid. In the second phase, the identified relationships from each base algorithm are then merged based on three combination rules. The proposed ensemble approach is evaluated using multiple metrics, among them a newly developed evaluation index for causal ensemble approaches. We perform experiments using three synthetic datasets with different volumes and complexity, which are specifically designed to test causality detection methods under different circumstances while knowing the ground truth causal relationships. In these experiments, our causality ensemble outperforms each of its base algorithms. In practical applications, the use of the proposed method could hence lead to more robust and reliable causality results.
- [873] arXiv:2403.04795 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Large Language Models in Fire Engineering: An Examination of Technical Questions Against Domain KnowledgeSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This communication presents preliminary findings from comparing two recent chatbots, OpenAI's ChatGPT and Google's Bard, in the context of fire engineering by evaluating their responses in handling fire safety related queries. A diverse range of fire engineering questions and scenarios were created and examined, including structural fire design, fire prevention strategies, evacuation, building code compliance, and fire suppression systems (some of which resemble those commonly present in the Fire Protection exam (FPE)). The results reveal some key differences in the performance of the chatbots, with ChatGPT demonstrating a relatively superior performance. Then, this communication highlights the potential for chatbot technology to revolutionize fire engineering practices by providing instant access to critical information while outlining areas for further improvement and research. Evidently, and when it matures, this technology will likely be elemental to our engineers' practice and education.
- [874] arXiv:2403.04799 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: AI Literacy in Low-Resource Languages:Insights from creating AI in Yoruba videosComments: Accepted at the Global AI Cultures Workshop, ICLR 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: To effectively navigate the AI revolution, AI literacy is crucial. However, content predominantly exists in dominant languages, creating a gap for low-resource languages like Yoruba (41 million native speakers). This case study explores bridging this gap by creating and distributing AI videos in Yoruba.The project developed 26 videos covering foundational, intermediate, and advanced AI concepts, leveraging storytelling and accessible explanations. These videos were created using a cost-effective methodology and distributed across YouTube, LinkedIn, and Twitter, reaching an estimated global audience of 22 countries. Analysis of YouTube reveals insights into viewing patterns, with the 25-44 age group contributing the most views. Notably, over half of the traffic originated from external sources, highlighting the potential of cross-platform promotion.This study demonstrates the feasibility and impact of creating AI literacy content in low-resource languages. It emphasizes that accurate interpretation requires both technical expertise in AI and fluency in the target language. This work contributes a replicable methodology, a 22-word Yoruba AI vocabulary, and data-driven insights into audience demographics and acquisition channel
- [875] arXiv:2403.04803 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Enhancing Security in Federated Learning through Adaptive Consensus-Based Model Update ValidationSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Abstract: This paper introduces an advanced approach for fortifying Federated Learning (FL) systems against label-flipping attacks. We propose a simplified consensus-based verification process integrated with an adaptive thresholding mechanism. This dynamic thresholding is designed to adjust based on the evolving landscape of model updates, offering a refined layer of anomaly detection that aligns with the real-time needs of distributed learning environments. Our method necessitates a majority consensus among participating clients to validate updates, ensuring that only vetted and consensual modifications are applied to the global model. The efficacy of our approach is validated through experiments on two benchmark datasets in deep learning, CIFAR-10 and MNIST. Our results indicate a significant mitigation of label-flipping attacks, bolstering the FL system's resilience. This method transcends conventional techniques that depend on anomaly detection or statistical validation by incorporating a verification layer reminiscent of blockchain's participatory validation without the associated cryptographic overhead. The innovation of our approach rests in striking an optimal balance between heightened security measures and the inherent limitations of FL systems, such as computational efficiency and data privacy. Implementing a consensus mechanism specifically tailored for FL environments paves the way for more secure, robust, and trustworthy distributed machine learning applications, where safeguarding data integrity and model robustness is critical.
- [876] arXiv:2403.04807 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Mathematics of Neural Networks (Lecture Notes Graduate Course)Comments: Lecture notes of the graduate course 2MMA80 Mathematics of Neural Networks as thought at the Eindhoven University of Technology from 2021 to 2023Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: These are the lecture notes that accompanied the course of the same name that I taught at the Eindhoven University of Technology from 2021 to 2023. The course is intended as an introduction to neural networks for mathematics students at the graduate level and aims to make mathematics students interested in further researching neural networks. It consists of two parts: first a general introduction to deep learning that focuses on introducing the field in a formal mathematical way. The second part provides an introduction to the theory of Lie groups and homogeneous spaces and how it can be applied to design neural networks with desirable geometric equivariances. The lecture notes were made to be as self-contained as possible so as to accessible for any student with a moderate mathematics background. The course also included coding tutorials and assignments in the form of a set of Jupyter notebooks that are publicly available at this https URL .
- [877] arXiv:2403.04810 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Restricted Bayesian Neural NetworkSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Modern deep learning tools are remarkably effective in addressing intricate problems. However, their operation as black-box models introduces increased uncertainty in predictions. Additionally, they contend with various challenges, including the need for substantial storage space in large networks, issues of overfitting, underfitting, vanishing gradients, and more. This study explores the concept of Bayesian Neural Networks, presenting a novel architecture designed to significantly alleviate the storage space complexity of a network. Furthermore, we introduce an algorithm adept at efficiently handling uncertainties, ensuring robust convergence values without becoming trapped in local optima, particularly when the objective function lacks perfect convexity.
- [878] arXiv:2403.04814 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle TasksSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
Abstract: We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. Our findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at this https URL , and the leaderboard is available at this https URL .
- [879] arXiv:2403.04894 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ConstitutionalExperts: Training a Mixture of Principle-based PromptsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) are highly capable at a variety of tasks given the right prompt, but writing one is still a difficult and tedious process. In this work, we introduce ConstitutionalExperts, a method for learning a prompt consisting of constitutional principles (i.e. rules), given a training dataset. Unlike prior methods that optimize the prompt as a single entity, our method incrementally improves the prompt by surgically editing individual principles. We also show that we can improve overall performance by learning unique prompts for different semantic regions of the training data and using a mixture-of-experts (MoE) architecture to route inputs at inference time. We compare our method to other state of the art prompt-optimization techniques across six benchmark datasets. We also investigate whether MoE improves these other techniques. Our results suggest that ConstitutionalExperts outperforms other prompt optimization techniques by 10.9% (F1) and that mixture-of-experts improves all techniques, suggesting its broad applicability.
- [880] arXiv:2403.04899 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Towards Scene Graph AnticipationComments: Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Spatio-temporal scene graphs represent interactions in a video by decomposing scenes into individual objects and their pair-wise temporal relationships. Long-term anticipation of the fine-grained pair-wise relationships between objects is a challenging problem. To this end, we introduce the task of Scene Graph Anticipation (SGA). We adapt state-of-the-art scene graph generation methods as baselines to anticipate future pair-wise relationships between objects and propose a novel approach SceneSayer. In SceneSayer, we leverage object-centric representations of relationships to reason about the observed video frames and model the evolution of relationships between objects. We take a continuous time perspective and model the latent dynamics of the evolution of object interactions using concepts of NeuralODE and NeuralSDE, respectively. We infer representations of future relationships by solving an Ordinary Differential Equation and a Stochastic Differential Equation, respectively. Extensive experimentation on the Action Genome dataset validates the efficacy of the proposed methods.
- [881] arXiv:2403.04917 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: A Mixed-Integer Conic Program for the Moving-Target Traveling Salesman Problem based on a Graph of Convex SetsComments: 7 pages, 4 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
Abstract: This paper introduces a new formulation that finds the optimum for the Moving-Target Traveling Salesman Problem (MT-TSP), which seeks to find a shortest path for an agent, that starts at a depot, visits a set of moving targets exactly once within their assigned time-windows, and returns to the depot. The formulation relies on the key idea that when the targets move along lines, their trajectories become convex sets within the space-time coordinate system. The problem then reduces to finding the shortest path within a graph of convex sets, subject to some speed constraints. We compare our formulation with the current state-of-the-art Mixed Integer Conic Program (MICP) solver for the MT-TSP. The experimental results show that our formulation outperforms the MICP for instances with up to 20 targets, with up to two orders of magnitude reduction in runtime, and up to a 60\% tighter optimality gap. We also show that the solution cost from the convex relaxation of our formulation provides significantly tighter lower bounds for the MT-TSP than the ones from the MICP.
- [882] arXiv:2403.04929 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: On the Markov Property of Neural Algorithmic Reasoning: Analyses and MethodsComments: To appear at ICLR 2024 (Spotlight paper). 17 pages, 10 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Neural algorithmic reasoning is an emerging research direction that endows neural networks with the ability to mimic algorithmic executions step-by-step. A common paradigm in existing designs involves the use of historical embeddings in predicting the results of future execution steps. Our observation in this work is that such historical dependence intrinsically contradicts the Markov nature of algorithmic reasoning tasks. Based on this motivation, we present our ForgetNet, which does not use historical embeddings and thus is consistent with the Markov nature of the tasks. To address challenges in training ForgetNet at early stages, we further introduce G-ForgetNet, which uses a gating mechanism to allow for the selective integration of historical embeddings. Such an enhanced capability provides valuable computational pathways during the model's early training phase. Our extensive experiments, based on the CLRS-30 algorithmic reasoning benchmark, demonstrate that both ForgetNet and G-ForgetNet achieve better generalization capability than existing methods. Furthermore, we investigate the behavior of the gating mechanism, highlighting its degree of alignment with our intuitions and its effectiveness for robust performance.
- [883] arXiv:2403.04934 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: LeTac-MPC: Learning Model Predictive Control for Tactile-reactive GraspingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Grasping is a crucial task in robotics, necessitating tactile feedback and reactive grasping adjustments for robust grasping of objects under various conditions and with differing physical properties. In this paper, we introduce LeTac-MPC, a learning-based model predictive control (MPC) for tactile-reactive grasping. Our approach enables the gripper grasp objects with different physical properties on dynamic and force-interactive tasks. We utilize a vision-based tactile sensor, GelSight, which is capable of perceiving high-resolution tactile feedback that contains the information of physical properties and states of the grasped object. LeTac-MPC incorporates a differentiable MPC layer designed to model the embeddings extracted by a neural network (NN) from tactile feedback. This design facilitates convergent and robust grasping control at a frequency of 25 Hz. We propose a fully automated data collection pipeline and collect a dataset only using standardized blocks with different physical properties. However, our trained controller can generalize to daily objects with different sizes, shapes, materials, and textures. Experimental results demonstrate the effectiveness and robustness of the proposed approach. We compare LeTac-MPC with two purely model-based tactile-reactive controllers (MPC and PD) and open-loop grasping. Our results show that LeTac-MPC has the best performance on dynamic and force-interactive tasks and the best generalization ability. We release our code and dataset at this https URL .
- [884] arXiv:2403.04940 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: A spatiotemporal style transfer algorithm for dynamic visual stimulus generationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Abstract: Understanding how visual information is encoded in biological and artificial systems often requires vision scientists to generate appropriate stimuli to test specific hypotheses. Although deep neural network models have revolutionized the field of image generation with methods such as image style transfer, available methods for video generation are scarce. Here, we introduce the Spatiotemporal Style Transfer (STST) algorithm, a dynamic visual stimulus generation framework that allows powerful manipulation and synthesis of video stimuli for vision research. It is based on a two-stream deep neural network model that factorizes spatial and temporal features to generate dynamic visual stimuli whose model layer activations are matched to those of input videos. As an example, we show that our algorithm enables the generation of model metamers, dynamic stimuli whose layer activations within our two-stream model are matched to those of natural videos. We show that these generated stimuli match the low-level spatiotemporal features of their natural counterparts but lack their high-level semantic features, making it a powerful paradigm to study object recognition. Late layer activations in deep vision models exhibited a lower similarity between natural and metameric stimuli compared to early layers, confirming the lack of high-level information in the generated stimuli. Finally, we use our generated stimuli to probe the representational capabilities of predictive coding deep networks. These results showcase potential applications of our algorithm as a versatile tool for dynamic stimulus generation in vision science.
- [885] arXiv:2403.04954 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Fooling Neural Networks for Motion Forecasting via Adversarial AttacksComments: 11 pages, 8 figures, VISSAP 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Human motion prediction is still an open problem, which is extremely important for autonomous driving and safety applications. Although there are great advances in this area, the widely studied topic of adversarial attacks has not been applied to multi-regression models such as GCNs and MLP-based architectures in human motion prediction. This work intends to reduce this gap using extensive quantitative and qualitative experiments in state-of-the-art architectures similar to the initial stages of adversarial attacks in image classification. The results suggest that models are susceptible to attacks even on low levels of perturbation. We also show experiments with 3D transformations that affect the model performance, in particular, we show that most models are sensitive to simple rotations and translations which do not alter joint distances. We conclude that similar to earlier CNN models, motion forecasting tasks are susceptible to small perturbations and simple 3D transformations.
- [886] arXiv:2403.04960 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: SecGPT: An Execution Isolation Architecture for LLM-Based SystemsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) extended as systems, such as ChatGPT, have begun supporting third-party applications. These LLM apps leverage the de facto natural language-based automated execution paradigm of LLMs: that is, apps and their interactions are defined in natural language, provided access to user data, and allowed to freely interact with each other and the system. These LLM app ecosystems resemble the settings of earlier computing platforms, where there was insufficient isolation between apps and the system. Because third-party apps may not be trustworthy, and exacerbated by the imprecision of the natural language interfaces, the current designs pose security and privacy risks for users. In this paper, we propose SecGPT, an architecture for LLM-based systems that aims to mitigate the security and privacy issues that arise with the execution of third-party apps. SecGPT's key idea is to isolate the execution of apps and more precisely mediate their interactions outside of their isolated environments. We evaluate SecGPT against a number of case study attacks and demonstrate that it protects against many security, privacy, and safety issues that exist in non-isolated LLM-based systems. The performance overhead incurred by SecGPT to improve security is under 0.3x for three-quarters of the tested queries. To foster follow-up research, we release SecGPT's source code at this https URL .
- [887] arXiv:2403.04963 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: An In-depth Evaluation of GPT-4 in Sentence Simplification with Error-based Human AssessmentSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Sentence simplification, which rewrites a sentence to be easier to read and understand, is a promising technique to help people with various reading difficulties. With the rise of advanced large language models (LLMs), evaluating their performance in sentence simplification has become imperative. Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the GPT-4's simplification capabilities. Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's struggles with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that while these metrics are effective for significant quality differences, they lack sufficient sensitivity to assess the overall high-quality simplification by GPT-4.
- [888] arXiv:2403.04965 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The demand for stereo images increases as manufacturers launch more XR devices. To meet this demand, we introduce StereoDiffusion, a method that, unlike traditional inpainting pipelines, is trainning free, remarkably straightforward to use, and it seamlessly integrates into the original Stable Diffusion model. Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs, without the need for fine-tuning model weights or any post-processing of images. Using the original input to generate a left image and estimate a disparity map for it, we generate the latent vector for the right image through Stereo Pixel Shift operations, complemented by Symmetric Pixel Shift Masking Denoise and Self-Attention Layers Modification methods to align the right-side image with the left-side image. Moreover, our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.
- [889] arXiv:2403.04977 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Node Centrality Approximation For Large Networks Based On Inductive Graph Neural NetworksSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI)
Abstract: Closeness Centrality (CC) and Betweenness Centrality (BC) are crucial metrics in network analysis, providing essential reference for discerning the significance of nodes within complex networks. These measures find wide applications in critical tasks, such as community detection and network dismantling. However, their practical implementation on extensive networks remains computationally demanding due to their high time complexity. To mitigate these computational challenges, numerous approximation algorithms have been developed to expedite the computation of CC and BC. Nevertheless, even these approximations still necessitate substantial processing time when applied to large-scale networks. Furthermore, their output proves sensitive to even minor perturbations within the network structure.
In this work, We redefine the CC and BC node ranking problem as a machine learning problem and propose the CNCA-IGE model, which is an encoder-decoder model based on inductive graph neural networks designed to rank nodes based on specified CC or BC metrics. We incorporate the MLP-Mixer model as the decoder in the BC ranking prediction task to enhance the model's robustness and capacity. Our approach is evaluated on diverse synthetic and real-world networks of varying scales, and the experimental results demonstrate that the CNCA-IGE model outperforms state-of-the-art baseline models, significantly reducing execution time while improving performance. - [890] arXiv:2403.05004 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Can't Remember Details in Long Documents? You Need Some R&RComments: 13 pages, 1 figure, 9 tables. For associated code repository see this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Long-context large language models (LLMs) hold promise for tasks such as question-answering (QA) over long documents, but they tend to miss important information in the middle of context documents ( arXiv:2307.03172v3 ). Here, we introduce $\textit{R&R}$ -- a combination of two novel prompt-based methods called $\textit{reprompting}$ and $\textit{in-context retrieval}$ (ICR) -- to alleviate this effect in document-based QA. In reprompting, we repeat the prompt instructions periodically throughout the context document to remind the LLM of its original task. In ICR, rather than instructing the LLM to answer the question directly, we instruct it to retrieve the top $k$ passage numbers most relevant to the given question, which are then used as an abbreviated context in a second QA prompt. We test R&R with GPT-4 Turbo and Claude-2.1 on documents up to 80k tokens in length and observe a 16-point boost in QA accuracy on average. Our further analysis suggests that R&R improves performance on long document-based QA because it reduces the distance between relevant context and the instructions. Finally, we show that compared to short-context chunkwise methods, R&R enables the use of larger chunks that cost fewer LLM calls and output tokens, while minimizing the drop in accuracy.
- [891] arXiv:2403.05006 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Provable Multi-Party Reinforcement Learning with Diverse Human FeedbackSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
Abstract: Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity.
- [892] arXiv:2403.05010 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: RFWave: Multi-band Rectified Flow for Audio Waveform ReconstructionSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Recent advancements in generative modeling have led to significant progress in audio waveform reconstruction from diverse representations. Although diffusion models have been used for reconstructing audio waveforms, they tend to exhibit latency issues because they operate at the level of individual sample points and require a relatively large number of sampling steps. In this study, we introduce RFWave, a novel multi-band Rectified Flow approach that reconstructs high-fidelity audio waveforms from Mel-spectrograms. RFWave is distinctive for generating complex spectrograms and operating at the frame level, processing all subbands concurrently to enhance efficiency. Thanks to Rectified Flow, which aims for a flat transport trajectory, RFWave requires only 10 sampling steps. Empirical evaluations demonstrate that RFWave achieves exceptional reconstruction quality and superior computational efficiency, capable of generating audio at a speed 90 times faster than real-time.
- [893] arXiv:2403.05014 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Simple Multigraph Convolution NetworksComments: Accepted by WWW 2024 ShortSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Existing multigraph convolution methods either ignore the cross-view interaction among multiple graphs, or induce extremely high computational cost due to standard cross-view polynomial operators. To alleviate this problem, this paper proposes a Simple MultiGraph Convolution Networks (SMGCN) which first extracts consistent cross-view topology from multigraphs including edge-level and subgraph-level topology, then performs polynomial expansion based on raw multigraphs and consistent topologies. In theory, SMGCN utilizes the consistent topologies in polynomial expansion rather than standard cross-view polynomial expansion, which performs credible cross-view spatial message-passing, follows the spectral convolution paradigm, and effectively reduces the complexity of standard polynomial expansion. In the simulations, experimental results demonstrate that SMGCN achieves state-of-the-art performance on ACM and DBLP multigraph benchmark datasets. Our codes are available at this https URL .
- [894] arXiv:2403.05020 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent advances in large language models (LLM) have enabled richer social simulations, allowing for the study of various social phenomena. However, most recent work has used a more omniscient perspective on these simulations (e.g., single LLM to generate all interlocutors), which is fundamentally at odds with the non-omniscient, information asymmetric interactions that involve humans and AI agents in the real world. To examine these differences, we develop an evaluation framework to simulate social interactions with LLMs in various settings (omniscient, non-omniscient). Our experiments show that LLMs perform better in unrealistic, omniscient simulation settings but struggle in ones that more accurately reflect real-world conditions with information asymmetry. Our findings indicate that addressing information asymmetry remains a fundamental challenge for LLM-based agents.
- [895] arXiv:2403.05026 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Spectral Invariant Learning for Dynamic Graphs under Distribution ShiftsComments: NeurIPS'23Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Dynamic graph neural networks (DyGNNs) currently struggle with handling distribution shifts that are inherent in dynamic graphs. Existing work on DyGNNs with out-of-distribution settings only focuses on the time domain, failing to handle cases involving distribution shifts in the spectral domain. In this paper, we discover that there exist cases with distribution shifts unobservable in the time domain while observable in the spectral domain, and propose to study distribution shifts on dynamic graphs in the spectral domain for the first time. However, this investigation poses two key challenges: i) it is non-trivial to capture different graph patterns that are driven by various frequency components entangled in the spectral domain; and ii) it remains unclear how to handle distribution shifts with the discovered spectral patterns. To address these challenges, we propose Spectral Invariant Learning for Dynamic Graphs under Distribution Shifts (SILD), which can handle distribution shifts on dynamic graphs by capturing and utilizing invariant and variant spectral patterns. Specifically, we first design a DyGNN with Fourier transform to obtain the ego-graph trajectory spectrums, allowing the mixed dynamic graph patterns to be transformed into separate frequency components. We then develop a disentangled spectrum mask to filter graph dynamics from various frequency components and discover the invariant and variant spectral patterns. Finally, we propose invariant spectral filtering, which encourages the model to rely on invariant patterns for generalization under distribution shifts. Experimental results on synthetic and real-world dynamic graph datasets demonstrate the superiority of our method for both node classification and link prediction tasks under distribution shifts.
- [896] arXiv:2403.05030 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without generating inputs that elicit them. LAT leverages the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. We use it to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
- [897] arXiv:2403.05033 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Quantifying Manifolds: Do the manifolds learned by Generative Adversarial Networks converge to the real data manifoldComments: arXiv admin note: text overlap with arXiv:2311.13102Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents our experiments to quantify the manifolds learned by ML models (in our experiment, we use a GAN model) as they train. We compare the manifolds learned at each epoch to the real manifolds representing the real data. To quantify a manifold, we study the intrinsic dimensions and topological features of the manifold learned by the ML model, how these metrics change as we continue to train the model, and whether these metrics convergence over the course of training to the metrics of the real data manifold.
- [898] arXiv:2403.05045 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Are Human Conversations Special? A Large Language Model PerspectiveSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study analyzes changes in the attention mechanisms of large language models (LLMs) when used to understand natural conversations between humans (human-human). We analyze three use cases of LLMs: interactions over web content, code, and mathematical texts. By analyzing attention distance, dispersion, and interdependency across these domains, we highlight the unique challenges posed by conversational data. Notably, conversations require nuanced handling of long-term contextual relationships and exhibit higher complexity through their attention patterns. Our findings reveal that while language models exhibit domain-specific attention behaviors, there is a significant gap in their ability to specialize in human conversations. Through detailed attention entropy analysis and t-SNE visualizations, we demonstrate the need for models trained with a diverse array of high-quality conversational data to enhance understanding and generation of human-like dialogue. This research highlights the importance of domain specialization in language models and suggests pathways for future advancement in modeling human conversational nuances.
- [899] arXiv:2403.05050 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DyRoNet: Dynamic Routing and Low-Rank Adapters for Autonomous Driving Streaming PerceptionComments: Project: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: The advancement of autonomous driving systems hinges on the ability to achieve low-latency and high-accuracy perception. To address this critical need, this paper introduces Dynamic Routering Network (DyRoNet), a low-rank enhanced dynamic routing framework designed for streaming perception in autonomous driving systems. DyRoNet integrates a suite of pre-trained branch networks, each meticulously fine-tuned to function under distinct environmental conditions. At its core, the framework offers a speed router module, developed to assess and route input data to the most suitable branch for processing. This approach not only addresses the inherent limitations of conventional models in adapting to diverse driving conditions but also ensures the balance between performance and efficiency. Extensive experimental evaluations demonstrating the adaptability of DyRoNet to diverse branch selection strategies, resulting in significant performance enhancements across different scenarios. This work not only establishes a new benchmark for streaming perception but also provides valuable engineering insights for future work.
- [900] arXiv:2403.05053 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention SteeringSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Image composition involves seamlessly integrating given objects into a specific visual context. The current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion in synthesis and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only slows down inference but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related words to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.
- [901] arXiv:2403.05063 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Aligning Large Language Models for Controllable RecommendationsComments: 13 pagesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Inspired by the exceptional general intelligence of Large Language Models (LLMs), researchers have begun to explore their application in pioneering the next generation of recommender systems - systems that are conversational, explainable, and controllable. However, existing literature primarily concentrates on integrating domain-specific knowledge into LLMs to enhance accuracy, often neglecting the ability to follow instructions. To address this gap, we initially introduce a collection of supervised learning tasks, augmented with labels derived from a conventional recommender model, aimed at explicitly improving LLMs' proficiency in adhering to recommendation-specific instructions. Subsequently, we develop a reinforcement learning-based alignment procedure to further strengthen LLMs' aptitude in responding to users' intentions and mitigating formatting errors. Through extensive experiments on two real-world datasets, our method markedly advances the capability of LLMs to comply with instructions within recommender systems, while sustaining a high level of accuracy performance.
- [902] arXiv:2403.05064 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Unsupervised Graph Neural Architecture Search with Disentangled Self-supervisionComments: NeurIPS'23Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The existing graph neural architecture search (GNAS) methods heavily rely on supervised labels during the search process, failing to handle ubiquitous scenarios where supervisions are not available. In this paper, we study the problem of unsupervised graph neural architecture search, which remains unexplored in the literature. The key problem is to discover the latent graph factors that drive the formation of graph data as well as the underlying relations between the factors and the optimal neural architectures. Handling this problem is challenging given that the latent graph factors together with architectures are highly entangled due to the nature of the graph and the complexity of the neural architecture search process. To address the challenge, we propose a novel Disentangled Self-supervised Graph Neural Architecture Search (DSGAS) model, which is able to discover the optimal architectures capturing various latent graph factors in a self-supervised fashion based on unlabeled graph data. Specifically, we first design a disentangled graph super-network capable of incorporating multiple architectures with factor-wise disentanglement, which are optimized simultaneously. Then, we estimate the performance of architectures under different factors by our proposed self-supervised training with joint architecture-graph disentanglement. Finally, we propose a contrastive search with architecture augmentations to discover architectures with factor-specific expertise. Extensive experiments on 11 real-world datasets demonstrate that the proposed model is able to achieve state-of-the-art performance against several baseline methods in an unsupervised manner.
- [903] arXiv:2403.05066 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Reset & Distill: A Recipe for Overcoming Negative Transfer in Continual Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We argue that one of the main obstacles for developing effective Continual Reinforcement Learning (CRL) algorithms is the negative transfer issue occurring when the new task to learn arrives. Through comprehensive experimental validation, we demonstrate that such issue frequently exists in CRL and cannot be effectively addressed by several recent work on mitigating plasticity loss of RL agents. To that end, we develop Reset & Distill (R&D), a simple yet highly effective method, to overcome the negative transfer problem in CRL. R&D combines a strategy of resetting the agent's online actor and critic networks to learn a new task and an offline learning step for distilling the knowledge from the online actor and previous expert's action probabilities. We carried out extensive experiments on long sequence of Meta-World tasks and show that our method consistently outperforms recent baselines, achieving significantly higher success rates across a range of tasks. Our findings highlight the importance of considering negative transfer in CRL and emphasize the need for robust strategies like R&D to mitigate its detrimental effects.
- [904] arXiv:2403.05100 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial HypervolumeSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: The escalating threat of adversarial attacks on deep learning models, particularly in security-critical fields, has underscored the need for robust deep learning systems. Conventional robustness evaluations have relied on adversarial accuracy, which measures a model's performance under a specific perturbation intensity. However, this singular metric does not fully encapsulate the overall resilience of a model against varying degrees of perturbation. To address this gap, we propose a new metric termed adversarial hypervolume, assessing the robustness of deep learning models comprehensively over a range of perturbation intensities from a multi-objective optimization standpoint. This metric allows for an in-depth comparison of defense mechanisms and recognizes the trivial improvements in robustness afforded by less potent defensive strategies. Additionally, we adopt a novel training algorithm that enhances adversarial robustness uniformly across various perturbation intensities, in contrast to methods narrowly focused on optimizing adversarial accuracy. Our extensive empirical studies validate the effectiveness of the adversarial hypervolume metric, demonstrating its ability to reveal subtle differences in robustness that adversarial accuracy overlooks. This research contributes a new measure of robustness and establishes a standard for assessing and benchmarking the resilience of current and future defensive models against adversarial threats.
- [905] arXiv:2403.05101 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Rule-driven News CaptioningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: News captioning task aims to generate sentences by describing named entities or concrete events for an image with its news article. Existing methods have achieved remarkable results by relying on the large-scale pre-trained models, which primarily focus on the correlations between the input news content and the output predictions. However, the news captioning requires adhering to some fundamental rules of news reporting, such as accurately describing the individuals and actions associated with the event. In this paper, we propose the rule-driven news captioning method, which can generate image descriptions following designated rule signal. Specifically, we first design the news-aware semantic rule for the descriptions. This rule incorporates the primary action depicted in the image (e.g., "performing") and the roles played by named entities involved in the action (e.g., "Agent" and "Place"). Second, we inject this semantic rule into the large-scale pre-trained model, BART, with the prefix-tuning strategy, where multiple encoder layers are embedded with news-aware semantic rule. Finally, we can effectively guide BART to generate news sentences that comply with the designated rule. Extensive experiments on two widely used datasets (i.e., GoodNews and NYTimes800k) demonstrate the effectiveness of our method.
- [906] arXiv:2403.05104 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: How Culture Shapes What People Want From AIComments: To appear at CHI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: There is an urgent need to incorporate the perspectives of culturally diverse groups into AI developments. We present a novel conceptual framework for research that aims to expand, reimagine, and reground mainstream visions of AI using independent and interdependent cultural models of the self and the environment. Two survey studies support this framework and provide preliminary evidence that people apply their cultural models when imagining their ideal AI. Compared with European American respondents, Chinese respondents viewed it as less important to control AI and more important to connect with AI, and were more likely to prefer AI with capacities to influence. Reflecting both cultural models, findings from African American respondents resembled both European American and Chinese respondents. We discuss study limitations and future directions and highlight the need to develop culturally responsive and relevant AI to serve a broader segment of the world population.
- [907] arXiv:2403.05105 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning to Rematch Mismatched Pairs for Robust Cross-Modal RetrievalComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval models. However, in real-world scenarios, massive multimodal data are harvested from the Internet, which inevitably contains Partially Mismatched Pairs (PMPs). Undoubtedly, such semantical irrelevant data will remarkably harm the cross-modal retrieval performance. Previous efforts tend to mitigate this problem by estimating a soft correspondence to down-weight the contribution of PMPs. In this paper, we aim to address this challenge from a new perspective: the potential semantic similarity among unpaired samples makes it possible to excavate useful knowledge from mismatched pairs. To achieve this, we propose L2RM, a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs. In detail, L2RM aims to generate refined alignments by seeking a minimal-cost transport plan across different modalities. To formalize the rematching idea in OT, first, we propose a self-supervised cost function that automatically learns from explicit similarity-cost mapping relation. Second, we present to model a partial OT problem while restricting the transport among false positives to further boost refined alignments. Extensive experiments on three benchmarks demonstrate our L2RM significantly improves the robustness against PMPs for existing models. The code is available at this https URL .
- [908] arXiv:2403.05108 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: A Task-Driven Multi-UAV Coalition Formation MechanismSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: With the rapid advancement of UAV technology, the problem of UAV coalition formation has become a hotspot. Therefore, designing task-driven multi-UAV coalition formation mechanism has become a challenging problem. However, existing coalition formation mechanisms suffer from low relevance between UAVs and task requirements, resulting in overall low coalition utility and unstable coalition structures. To address these problems, this paper proposed a novel multi-UAV coalition network collaborative task completion model, considering both coalition work capacity and task-requirement relationships. This model stimulated the formation of coalitions that match task requirements by using a revenue function based on the coalition's revenue threshold. Subsequently, an algorithm for coalition formation based on marginal utility was proposed. Specifically, the algorithm utilized Shapley value to achieve fair utility distribution within the coalition, evaluated coalition values based on marginal utility preference order, and achieved stable coalition partition through a limited number of iterations. Additionally, we theoretically proved that this algorithm has Nash equilibrium solution. Finally, experimental results demonstrated that the proposed algorithm, compared to currently classical algorithms, not only forms more stable coalitions but also further enhances the overall utility of coalitions effectively.
- [909] arXiv:2403.05110 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Efficient Data Collection for Robotic Manipulation via Compositional GeneralizationComments: 17 pagesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Data collection has become an increasingly important problem in robotic manipulation, yet there still lacks much understanding of how to effectively collect data to facilitate broad generalization. Recent works on large-scale robotic data collection typically vary a wide range of environmental factors during data collection, such as object types and table textures. While these works attempt to cover a diverse variety of scenarios, they do not explicitly account for the possible compositional abilities of policies trained on the data. If robot policies are able to compose different environmental factors of variation (e.g., object types, table heights) from their training data to succeed when encountering unseen factor combinations, then we can exploit this to avoid collecting data for situations that composition would address. To investigate this possibility, we conduct thorough empirical studies both in simulation and on a real robot that compare data collection strategies and assess whether visual imitation learning policies can compose environmental factors. We find that policies do exhibit composition, although leveraging prior robotic datasets is critical for this on a real robot. We use these insights to provide better practices for in-domain data collection by proposing data collection strategies that exploit composition, which can induce better generalization than naive approaches for the same amount of effort during data collection. We further demonstrate that a real robot policy trained on data from such a strategy achieves a success rate of 77.5% when transferred to entirely new environments that encompass unseen combinations of environmental factors, whereas policies trained using data collected without accounting for environmental variation fail to transfer effectively, with a success rate of only 2.5%. We provide videos at this http URL .
- [910] arXiv:2403.05125 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image SynthesisSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we present an empirical study introducing a nuanced evaluation framework for text-to-image (T2I) generative models, applied to human image synthesis. Our framework categorizes evaluations into two distinct groups: first, focusing on image qualities such as aesthetics and realism, and second, examining text conditions through concept coverage and fairness. We introduce an innovative aesthetic score prediction model that assesses the visual appeal of generated images and unveils the first dataset marked with low-quality regions in generated human images to facilitate automatic defect detection. Our exploration into concept coverage probes the model's effectiveness in interpreting and rendering text-based concepts accurately, while our analysis of fairness reveals biases in model outputs, with an emphasis on gender, race, and age. While our study is grounded in human imagery, this dual-faceted approach is designed with the flexibility to be applicable to other forms of image generation, enhancing our understanding of generative models and paving the way to the next generation of more sophisticated, contextually aware, and ethically attuned generative models. We will release our code, the data used for evaluating generative models and the dataset annotated with defective areas soon.
- [911] arXiv:2403.05129 (cross-list from cond-mat.soft) [ pdf , ps , html , other ]
-
Title: Unraveling the Molecular Magic: AI Insights on the Formation of Extraordinarily Stretchable HydrogelsShahriar Hojjati Emmami , Ali Pilehvar Meibody , Lobat Tayebi , Mohammadamin Tavakoli , Pierre BaldiSubjects: Soft Condensed Matter (cond-mat.soft) ; Artificial Intelligence (cs.AI)
Abstract: The deliberate manipulation of ammonium persulfate, methylenebisacrylamide, dimethyleacrylamide, and polyethylene oxide concentrations resulted in the development of a hydrogel with an exceptional stretchability, capable of extending up to 260 times its original length. This study aims to elucidate the molecular architecture underlying this unique phenomenon by exploring potential reaction mechanisms, facilitated by an artificial intelligence prediction system. Artificial intelligence predictor introduces a novel approach to interlinking two polymers, involving the formation of networks interconnected with linear chains following random chain scission. This novel configuration leads to the emergence of a distinct type of hydrogel, herein referred to as a "Span Network." Additionally, Fourier-transform infrared spectroscopy (FTIR) is used to investigate functional groups that may be implicated in the proposed mechanism, with ester formation confirmed among numerous hydroxyl end groups obtained from chain scission of PEO and carboxyl groups formed on hydrogel networks.
- [912] arXiv:2403.05132 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ChatUIE: Exploring Chat-based Unified Information Extraction using Large Language ModelsComments: Accepted by LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in large language models have shown impressive performance in general chat. However, their domain-specific capabilities, particularly in information extraction, have certain limitations. Extracting structured information from natural language that deviates from known schemas or instructions has proven challenging for previous prompt-based methods. This motivated us to explore domain-specific modeling in chat-based language models as a solution for extracting structured information from natural language. In this paper, we present ChatUIE, an innovative unified information extraction framework built upon ChatGLM. Simultaneously, reinforcement learning is employed to improve and align various tasks that involve confusing and limited samples. Furthermore, we integrate generation constraints to address the issue of generating elements that are not present in the input. Our experimental results demonstrate that ChatUIE can significantly improve the performance of information extraction with a slight decrease in chatting ability.
- [913] arXiv:2403.05149 (cross-list from physics.app-ph) [ pdf , ps , html , other ]
-
Title: Inverse Design of Photonic Crystal Surface Emitting Lasers is a Sequence Modeling ProblemComments: accepted by AAAI workshop AI2ASE(2024) this https URLSubjects: Applied Physics (physics.app-ph) ; Artificial Intelligence (cs.AI)
Abstract: Photonic Crystal Surface Emitting Lasers (PCSEL)'s inverse design demands expert knowledge in physics, materials science, and quantum mechanics which is prohibitively labor-intensive. Advanced AI technologies, especially reinforcement learning (RL), have emerged as a powerful tool to augment and accelerate this inverse design process. By modeling the inverse design of PCSEL as a sequential decision-making problem, RL approaches can construct a satisfactory PCSEL structure from scratch. However, the data inefficiency resulting from online interactions with precise and expensive simulation environments impedes the broader applicability of RL approaches. Recently, sequential models, especially the Transformer architecture, have exhibited compelling performance in sequential decision-making problems due to their simplicity and scalability to large language models. In this paper, we introduce a novel framework named PCSEL Inverse Design Transformer (PiT) that abstracts the inverse design of PCSEL as a sequence modeling problem. The central part of our PiT is a Transformer-based structure that leverages the past trajectories and current states to predict the current actions. Compared with the traditional RL approaches, PiT can output the optimal actions and achieve target PCSEL designs by leveraging offline data and conditioning on the desired return. Results demonstrate that PiT achieves superior performance and data efficiency compared to baselines.
- [914] arXiv:2403.05152 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Towards a Psychology of Machines: Large Language Models Predict Human MemoryComments: 32 pages, 3 figures, 2 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) are demonstrating remarkable capabilities across various tasks despite lacking a foundation in human cognition. This raises the question: can these models, beyond simply mimicking human language patterns, offer insights into the mechanisms underlying human cognition? This study explores the ability of ChatGPT to predict human performance in a language-based memory task. Building upon theories of text comprehension, we hypothesize that recognizing ambiguous sentences (e.g., "Because Bill drinks wine is never kept in the house") is facilitated by preceding them with contextually relevant information. Participants, both human and ChatGPT, were presented with pairs of sentences. The second sentence was always a garden-path sentence designed to be inherently ambiguous, while the first sentence either provided a fitting (e.g., "Bill has chronic alcoholism") or an unfitting context (e.g., "Bill likes to play golf"). We measured both human's and ChatGPT's ratings of sentence relatedness, ChatGPT's memorability ratings for the garden-path sentences, and humans' spontaneous memory for the garden-path sentences. The results revealed a striking alignment between ChatGPT's assessments and human performance. Sentences deemed more related and assessed as being more memorable by ChatGPT were indeed better remembered by humans, even though ChatGPT's internal mechanisms likely differ significantly from human cognition. This finding, which was confirmed with a robustness check employing synonyms, underscores the potential of generative AI models to predict human performance accurately. We discuss the broader implications of these findings for leveraging LLMs in the development of psychological theories and for gaining a deeper understanding of human cognition.
- [915] arXiv:2403.05158 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Adaptive Split Learning over Energy-Constrained Wireless Edge NetworksComments: 6 pages, 5 figures, 20 conferencesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Abstract: Split learning (SL) is a promising approach for training artificial intelligence (AI) models, in which devices collaborate with a server to train an AI model in a distributed manner, based on a same fixed split point. However, due to the device heterogeneity and variation of channel conditions, this way is not optimal in training delay and energy consumption. In this paper, we design an adaptive split learning (ASL) scheme which can dynamically select split points for devices and allocate computing resource for the server in wireless edge networks. We formulate an optimization problem to minimize the average training latency subject to long-term energy consumption constraint. The difficulties in solving this problem are the lack of future information and mixed integer programming (MIP). To solve it, we propose an online algorithm leveraging the Lyapunov theory, named OPEN, which decomposes it into a new MIP problem only with the current information. Then, a two-layer optimization method is proposed to solve the MIP problem. Extensive simulation results demonstrate that the ASL scheme can reduce the average training delay and energy consumption by 53.7% and 22.1%, respectively, as compared to the existing SL schemes.
- [916] arXiv:2403.05164 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Synthetic data generation for system identification: leveraging knowledge transfer from similar systemsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: This paper addresses the challenge of overfitting in the learning of dynamical systems by introducing a novel approach for the generation of synthetic data, aimed at enhancing model generalization and robustness in scenarios characterized by data scarcity. Central to the proposed methodology is the concept of knowledge transfer from systems within the same class. Specifically, synthetic data is generated through a pre-trained meta-model that describes a broad class of systems to which the system of interest is assumed to belong. Training data serves a dual purpose: firstly, as input to the pre-trained meta model to discern the system's dynamics, enabling the prediction of its behavior and thereby generating synthetic output sequences for new input sequences; secondly, in conjunction with synthetic data, to define the loss function used for model estimation. A validation dataset is used to tune a scalar hyper-parameter balancing the relative importance of training and synthetic data in the definition of the loss function. The same validation set can be also used for other purposes, such as early stopping during the training, fundamental to avoid overfitting in case of small-size training datasets. The efficacy of the approach is shown through a numerical example that highlights the advantages of integrating synthetic data into the system identification process.
- [917] arXiv:2403.05168 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Unlocking the Potential of Multimodal Unified Discrete Representation through Training-Free Codebook Optimization and Hierarchical AlignmentHai Huang , Yan Xia , Shengpeng Ji , Shulei Wang , Hanting Wang , Jieming Zhu , Zhenhua Dong , Zhou ZhaoSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advances in representation learning have demonstrated the significance of multimodal alignment. The Dual Cross-modal Information Disentanglement (DCID) model, utilizing a unified codebook, shows promising results in achieving fine-grained representation and cross-modal generalization. However, it is still hindered by equal treatment of all channels and neglect of minor event information, resulting in interference from irrelevant channels and limited performance in fine-grained tasks. Thus, in this work, We propose a Training-free Optimization of Codebook (TOC) method to enhance model performance by selecting important channels in the unified space without retraining. Additionally, we introduce the Hierarchical Dual Cross-modal Information Disentanglement (H-DCID) approach to extend information separation and alignment to two levels, capturing more cross-modal details. The experiment results demonstrate significant improvements across various downstream tasks, with TOC contributing to an average improvement of 1.70% for DCID on four tasks, and H-DCID surpassing DCID by an average of 3.64%. The combination of TOC and H-DCID further enhances performance, exceeding DCID by 4.43%. These findings highlight the effectiveness of our methods in facilitating robust and nuanced cross-modal learning, opening avenues for future enhancements. The source code and pre-trained models can be accessed at this https URL .
- [918] arXiv:2403.05171 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty EstimationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.
- [919] arXiv:2403.05175 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Continual Learning and Catastrophic ForgettingComments: Preprint of a book chapter; 21 pages, 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
Abstract: This book chapter delves into the dynamics of continual learning, which is the process of incrementally learning from a non-stationary stream of data. Although continual learning is a natural skill for the human brain, it is very challenging for artificial neural networks. An important reason is that, when learning something new, these networks tend to quickly and drastically forget what they had learned before, a phenomenon known as catastrophic forgetting. Especially in the last decade, continual learning has become an extensively studied topic in deep learning. This book chapter reviews the insights that this field has generated.
- [920] arXiv:2403.05189 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Tracing the Roots of Facts in Multilingual Language Models: Independent, Shared, and Transferred KnowledgeComments: EACL 2024 main conferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Acquiring factual knowledge for language models (LMs) in low-resource languages poses a serious challenge, thus resorting to cross-lingual transfer in multilingual LMs (ML-LMs). In this study, we ask how ML-LMs acquire and represent factual knowledge. Using the multilingual factual knowledge probing dataset, mLAMA, we first conducted a neuron investigation of ML-LMs (specifically, multilingual BERT). We then traced the roots of facts back to the knowledge source (Wikipedia) to identify the ways in which ML-LMs acquire specific facts. We finally identified three patterns of acquiring and representing facts in ML-LMs: language-independent, cross-lingual shared and transferred, and devised methods for differentiating them. Our findings highlight the challenge of maintaining consistent factual knowledge across languages, underscoring the need for better fact representation learning in ML-LMs.
- [921] arXiv:2403.05209 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Overcoming Data Inequality across Domains with Semi-Supervised Domain GeneralizationComments: 20 pages, 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: While there have been considerable advancements in machine learning driven by extensive datasets, a significant disparity still persists in the availability of data across various sources and populations. This inequality across domains poses challenges in modeling for those with limited data, which can lead to profound practical and ethical concerns. In this paper, we address a representative case of data inequality problem across domains termed Semi-Supervised Domain Generalization (SSDG), in which only one domain is labeled while the rest are unlabeled. We propose a novel algorithm, ProUD, which can effectively learn domain-invariant features via domain-aware prototypes along with progressive generalization via uncertainty-adaptive mixing of labeled and unlabeled domains. Our experiments on three different benchmark datasets demonstrate the effectiveness of ProUD, outperforming all baseline models including single domain generalization and semi-supervised learning. Source code will be released upon acceptance of the paper.
- [922] arXiv:2403.05217 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question AnsweringComments: TheWebConf 2024 (WWW 2024) oral, code repo: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Open-domain question answering (ODQA) has emerged as a pivotal research spotlight in information systems. Existing methods follow two main paradigms to collect evidence: (1) The \textit{retrieve-then-read} paradigm retrieves pertinent documents from an external corpus; and (2) the \textit{generate-then-read} paradigm employs large language models (LLMs) to generate relevant documents. However, neither can fully address multifaceted requirements for evidence. To this end, we propose LLMQA, a generalized framework that formulates the ODQA process into three basic steps: query expansion, document selection, and answer generation, combining the superiority of both retrieval-based and generation-based evidence. Since LLMs exhibit their excellent capabilities to accomplish various tasks, we instruct LLMs to play multiple roles as generators, rerankers, and evaluators within our framework, integrating them to collaborate in the ODQA process. Furthermore, we introduce a novel prompt optimization algorithm to refine role-playing prompts and steer LLMs to produce higher-quality evidence and answers. Extensive experimental results on widely used benchmarks (NQ, WebQ, and TriviaQA) demonstrate that LLMQA achieves the best performance in terms of both answer accuracy and evidence quality, showcasing its potential for advancing ODQA research and applications.
- [923] arXiv:2403.05220 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Synthetic Privileged Information Enhances Medical Image Representation LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
Abstract: Multimodal self-supervised representation learning has consistently proven to be a highly effective method in medical image analysis, offering strong task performance and producing biologically informed insights. However, these methods heavily rely on large, paired datasets, which is prohibitive for their use in scenarios where paired data does not exist, or there is only a small amount available. In contrast, image generation methods can work well on very small datasets, and can find mappings between unpaired datasets, meaning an effectively unlimited amount of paired synthetic data can be generated. In this work, we demonstrate that representation learning can be significantly improved by synthetically generating paired information, both compared to training on either single-modality (up to 4.4x error reduction) or authentic multi-modal paired datasets (up to 5.6x error reduction).
- [924] arXiv:2403.05235 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Fairness-Aware Interpretable Modeling (FAIM) for Trustworthy Machine Learning in HealthcareMingxuan Liu , Yilin Ning , Yuhe Ke , Yuqing Shang , Bibhas Chakraborty , Marcus Eng Hock Ong , Roger Vaughan , Nan LiuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: The escalating integration of machine learning in high-stakes fields such as healthcare raises substantial concerns about model fairness. We propose an interpretable framework - Fairness-Aware Interpretable Modeling (FAIM), to improve model fairness without compromising performance, featuring an interactive interface to identify a "fairer" model from a set of high-performing models and promoting the integration of data-driven evidence and clinical expertise to enhance contextualized fairness. We demonstrated FAIM's value in reducing sex and race biases by predicting hospital admission with two real-world databases, MIMIC-IV-ED and SGH-ED. We show that for both datasets, FAIM models not only exhibited satisfactory discriminatory performance but also significantly mitigated biases as measured by well-established fairness metrics, outperforming commonly used bias-mitigation methods. Our approach demonstrates the feasibility of improving fairness without sacrificing performance and provides an a modeling mode that invites domain experts to engage, fostering a multidisciplinary effort toward tailored AI fairness.
- [925] arXiv:2403.05239 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image GenerationJunyan Wang , Zhenhong Sun , Zhiyu Tan , Xuanbai Chen , Weihua Chen , Hao Li , Cheng Zhang , Yang SongComments: Accepted to CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{ this https URL }.
- [926] arXiv:2403.05245 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Noise Level Adaptive Diffusion Model for Robust Reconstruction of Accelerated MRIShoujin Huang , Guanxiong Luo , Xi Wang , Ziran Chen , Yuwan Wang , Huaishui Yang , Pheng-Ann Heng , Lingyan Zhang , Mengye LyuSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In general, diffusion model-based MRI reconstruction methods incrementally remove artificially added noise while imposing data consistency to reconstruct the underlying images. However, real-world MRI acquisitions already contain inherent noise due to thermal fluctuations. This phenomenon is particularly notable when using ultra-fast, high-resolution imaging sequences for advanced research, or using low-field systems favored by low- and middle-income countries. These common scenarios can lead to sub-optimal performance or complete failure of existing diffusion model-based reconstruction techniques. Specifically, as the artificially added noise is gradually removed, the inherent MRI noise becomes increasingly pronounced, making the actual noise level inconsistent with the predefined denoising schedule and consequently inaccurate image reconstruction. To tackle this problem, we propose a posterior sampling strategy with a novel NoIse Level Adaptive Data Consistency (Nila-DC) operation. Extensive experiments are conducted on two public datasets and an in-house clinical dataset with field strength ranging from 0.3T to 3T, showing that our method surpasses the state-of-the-art MRI reconstruction methods, and is highly robust against various noise levels. The code will be released after review.
- [927] arXiv:2403.05266 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have achieved unprecedented performance in various applications, yet their evaluation remains a critical issue. Existing hallucination benchmarks are either static or lack adjustable complexity for thorough analysis. We contend that utilizing existing relational databases is a promising approach for constructing benchmarks due to their accurate knowledge description via functional dependencies. We propose ERBench to automatically convert any relational database into a benchmark based on the entity-relationship (ER) model. Our key idea is to construct questions using the database schema, records, and functional dependencies such that they can be automatically verified. In addition, we use foreign key constraints to join relations and construct multihop questions, which can be arbitrarily complex and used to debug the intermediate answers of LLMs. Finally, ERBench supports continuous evaluation, multimodal questions, and various prompt engineering techniques. In our experiments, we construct an LLM benchmark using databases of multiple domains and make an extensive comparison of contemporary LLMs. We observe that better LLMs like GPT-4 can handle a larger variety of question types, but are by no means perfect. Also, correct answers do not necessarily imply correct rationales, which is an important evaluation that ERBench does better than other benchmarks for various question types. Code is available at https: //github.com/DILAB-KAIST/ERBench.
- [928] arXiv:2403.05297 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PEEB: Part-based Image Classifiers with an Explainable and Editable Language BottleneckComments: Findings of NAACL 2024 (long paper)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB - an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (~10x in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 and Dogs-120, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings.
- [929] arXiv:2403.05300 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Unity by Diversity: Improved Representation Learning in Multimodal VAEsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation. Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior. This approach results in a superior latent representation and allows each encoding to preserve information from its uncompressed original features better. In extensive experiments on multiple benchmark datasets and a challenging real-world neuroscience data set, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.
- [930] arXiv:2403.05313 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon GenerationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination. In particular, the proposed method -- *retrieval-augmented thoughts* (RAT) -- revises each thought step one by one with retrieved information relevant to the task query, the current and the past thought steps, after the initial zero-shot CoT is generated. Applying RAT to GPT-3.5, GPT-4, and CodeLLaMA-7b substantially improves their performances on various long-horizon generation tasks; on average of relatively increasing rating scores by 13.63% on code generation, 16.96% on mathematical reasoning, 19.2% on creative writing, and 42.78% on embodied task planning. The demo page can be found at this https URL
- [931] arXiv:2403.05326 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ChatASU: Evoking LLM's Reflexion to Truly Understand Aspect Sentiment in DialoguesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Aspect Sentiment Understanding (ASU) in interactive scenarios (e.g., Question-Answering and Dialogue) has attracted ever-more interest in recent years and achieved important progresses. However, existing studies on interactive ASU largely ignore the coreference issue for opinion targets (i.e., aspects), while this phenomenon is ubiquitous in interactive scenarios especially dialogues, limiting the ASU performance. Recently, large language models (LLMs) shows the powerful ability to integrate various NLP tasks with the chat paradigm. In this way, this paper proposes a new Chat-based Aspect Sentiment Understanding (ChatASU) task, aiming to explore LLMs' ability in understanding aspect sentiments in dialogue scenarios. Particularly, this ChatASU task introduces a sub-task, i.e., Aspect Chain Reasoning (ACR) task, to address the aspect coreference issue. On this basis, we propose a Trusted Self-reflexion Approach (TSA) with ChatGLM as backbone to ChatASU. Specifically, this TSA treats the ACR task as an auxiliary task to boost the performance of the primary ASU task, and further integrates trusted learning into reflexion mechanisms to alleviate the LLMs-intrinsic factual hallucination problem in TSA. Furthermore, a high-quality ChatASU dataset is annotated to evaluate TSA, and extensive experiments show that our proposed TSA can significantly outperform several state-of-the-art baselines, justifying the effectiveness of TSA to ChatASU and the importance of considering the coreference and hallucination issues in ChatASU.
- [932] arXiv:2403.05334 (cross-list from cs.PL) [ pdf , ps , html , other ]
-
Title: WatChat: Explaining perplexing programs by debugging mental modelsSubjects: Programming Languages (cs.PL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Often, a good explanation for a program's unexpected behavior is a bug in the programmer's code. But sometimes, an even better explanation is a bug in the programmer's mental model of the language they are using. Instead of merely debugging our current code ("giving the programmer a fish"), what if our tools could directly debug our mental models ("teaching the programmer to fish")? In this paper, we apply ideas from computational cognitive science to do exactly that. Given a perplexing program, we use program synthesis techniques to automatically infer potential misconceptions that might cause the user to be surprised by the program's behavior. By analyzing these misconceptions, we provide succinct, useful explanations of the program's behavior. Our methods can even be inverted to synthesize pedagogical example programs for diagnosing and correcting misconceptions in students.
- [933] arXiv:2403.05379 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Self-Supervised Multiple Instance Learning for Acute Myeloid Leukemia ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Automated disease diagnosis using medical image analysis relies on deep learning, often requiring large labeled datasets for supervised model training. Diseases like Acute Myeloid Leukemia (AML) pose challenges due to scarce and costly annotations on a single-cell level. Multiple Instance Learning (MIL) addresses weakly labeled scenarios but necessitates powerful encoders typically trained with labeled data. In this study, we explore Self-Supervised Learning (SSL) as a pre-training approach for MIL-based AML subtype classification from blood smears, removing the need for labeled data during encoder training. We investigate the three state-of-the-art SSL methods SimCLR, SwAV, and DINO, and compare their performance against supervised pre-training. Our findings show that SSL-pretrained encoders achieve comparable performance, showcasing the potential of SSL in MIL. This breakthrough offers a cost-effective and data-efficient solution, propelling the field of AI-based disease diagnosis.
- [934] arXiv:2403.05396 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context InteractionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Histopathology serves as the gold standard in cancer diagnosis, with clinical reports being vital in interpreting and understanding this process, guiding cancer treatment and patient care. The automation of histopathology report generation with deep learning stands to significantly enhance clinical efficiency and lessen the labor-intensive, time-consuming burden on pathologists in report writing. In pursuit of this advancement, we introduce HistGen, a multiple instance learning-empowered framework for histopathology report generation together with the first benchmark dataset for evaluation. Inspired by diagnostic and report-writing workflows, HistGen features two delicately designed modules, aiming to boost report generation by aligning whole slide images (WSIs) and diagnostic reports from local and global granularity. To achieve this, a local-global hierarchical encoder is developed for efficient visual feature aggregation from a region-to-slide perspective. Meanwhile, a cross-modal context module is proposed to explicitly facilitate alignment and interaction between distinct modalities, effectively bridging the gap between the extensive visual sequences of WSIs and corresponding highly summarized reports. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin. Moreover, the results of fine-tuning our model on cancer subtyping and survival analysis tasks further demonstrate superior performance compared to SOTA methods, showcasing strong transfer learning capability. Dataset, model weights, and source code are available in this https URL .
- [935] arXiv:2403.05406 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for ForecastingComments: accepted by AAAI2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The forecasting of Multivariate Time Series (MTS) has long been an important but challenging task. Due to the non-stationary problem across long-distance time steps, previous studies primarily adopt stationarization method to attenuate the non-stationary problem of the original series for better predictability. However, existing methods always adopt the stationarized series, which ignores the inherent non-stationarity, and has difficulty in modeling MTS with complex distributions due to the lack of stochasticity. To tackle these problems, we first develop a powerful hierarchical probabilistic generative module to consider the non-stationarity and stochastic characteristics within MTS, and then combine it with transformer for a well-defined variational generative dynamic model named Hierarchical Time series Variational Transformer (HTV-Trans), which recovers the intrinsic non-stationary information into temporal dependencies. Being a powerful probabilistic model, HTV-Trans is utilized to learn expressive representations of MTS and applied to forecasting tasks. Extensive experiments on diverse datasets show the efficiency of HTV-Trans on MTS forecasting tasks
- [936] arXiv:2403.05465 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN InferenceComments: 2024 61st IEEE/ACM Design Automation Conference (DAC)Subjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract: Traditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training. In this study, we introduce Logarithmic Posits (LP), an adaptive, hardware-friendly data type inspired by posits that dynamically adapts to DNN weight/activation distributions by parameterizing LP bit fields. We also develop a novel genetic-algorithm based framework, LP Quantization (LPQ), to find optimal layer-wise LP parameters while reducing representational divergence between quantized and full-precision models through a novel global-local contrastive objective. Additionally, we design a unified mixed-precision LP accelerator (LPA) architecture comprising of processing elements (PEs) incorporating LP in the computational datapath. Our algorithm-hardware co-design demonstrates on average <1% drop in top-1 accuracy across various CNN and ViT models. It also achieves ~ 2x improvements in performance per unit area and 2.2x gains in energy efficiency compared to state-of-the-art quantization accelerators using different data types.
- [937] arXiv:2403.05468 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Will GPT-4 Run DOOM?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: We show that GPT-4's reasoning and planning capabilities extend to the 1993 first-person shooter Doom. This large language model (LLM) is able to run and play the game with only a few instructions, plus a textual description--generated by the model itself from screenshots--about the state of the game being observed. We find that GPT-4 can play the game to a passable degree: it is able to manipulate doors, combat enemies, and perform pathing. More complex prompting strategies involving multiple model calls provide better results. While further work is required to enable the LLM to play the game as well as its classical, reinforcement learning-based counterparts, we note that GPT-4 required no training, leaning instead on its own reasoning and observational capabilities. We hope our work pushes the boundaries on intelligent, LLM-based agents in video games. We conclude by discussing the ethical implications of our work.
- [938] arXiv:2403.05490 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Poly-View Contrastive LearningComments: Accepted to ICLR 2024. 42 pages, 7 figures, 3 tables, loss pseudo-code included in appendixSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (stat.ML)
Abstract: Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.
- [939] arXiv:2403.05518 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-ThoughtJames Chua , Edward Rees , Hunar Batra , Samuel R. Bowman , Julian Michael , Ethan Perez , Miles TurpinSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for example, rationalizing answers in line with a user's opinion without mentioning this bias. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where supervision for ground truth reasoning is unavailable.
- [940] arXiv:2403.05527 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLMSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at this https URL .
- [941] arXiv:2403.05530 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Gemini 1.5: Unlocking multimodal understanding across millions of tokens of contextGemini Team Google : Machel Reid , Nikolay Savinov , Denis Teplyashin , Dmitry (Dima) Lepikhin , Timothy Lillicrap , Jean-baptiste Alayrac , Radu Soricut , Angeliki Lazaridou , Orhan Firat , Julian Schrittwieser , Ioannis Antonoglou , Rohan Anil , Sebastian Borgeaud , Andrew Dai , Katie Millican , Ethan Dyer , Mia Glaese , Thibault Sottiaux , Benjamin Lee , Fabio Viola , Malcolm Reynolds , Yuanzhong Xu , James Molloy , Jilin Chen , Michael Isard , Paul Barham , Tom Hennigan , Ross McIlroy , Melvin Johnson , Johan Schalkwyk , Eli Collins , Eliza Rutherford , Erica Moreira , Kareem Ayoub , Megha Goel , Clemens Meyer , Gregory Thornton , Zhen Yang , Henryk Michalewski , Zaheer Abbas , Nathan Schucher , Ankesh Anand , Richard Ives , James Keeling , Karel Lenc , Salem Haykal , Siamak Shakeri , Pranav Shyam , Aakanksha Chowdhery , Roman Ring , Stephen Spencer , Eren Sezener , Luke Vilnis , Oscar Chang , Nobuyuki Morioka , George Tucker , Ce Zheng , Oliver Woodman , Nithya Attaluri , Tomas Kocisky , Evgenii Eltyshev , Xi Chen , Timothy Chung , Vittorio Selo , Siddhartha Brahma , Petko Georgiev , Ambrose Slone , Zhenkai Zhu , James Lottes , Siyuan Qiao , Ben Caine , Sebastian Riedel , Alex Tomala , Martin Chadwick , Juliette Love , Peter Choy , Sid Mittal , Neil Houlsby , Yunhao Tang , Matthew Lamm , Libin Bai , Qiao Zhang , Luheng He , Yong Cheng , Peter Humphreys , Yujia Li , Sergey Brin , Albin Cassirer , Yingjie Miao , Lukas Zilka , Taylor Tobin , Kelvin Xu , Lev Proleev , Daniel Sohn , Alberto Magni , Lisa Anne Hendricks , Isabel Gao , Santiago OntanonSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
- [942] arXiv:2403.05535 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and VideosComments: Project Page and Code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: We introduce LaGTran, a novel framework that utilizes readily available or easily acquired text descriptions to guide robust transfer of discriminative knowledge from labeled source to unlabeled target data with domain shifts. While unsupervised adaptation methods have been established to address this problem, they show limitations in handling challenging domain shifts due to their exclusive operation within the pixel-space. Motivated by our observation that semantically richer text modality has more favorable transfer properties, we devise a transfer mechanism to use a source-trained text-classifier to generate predictions on the target text descriptions, and utilize these predictions as supervision for the corresponding images. Our approach driven by language guidance is surprisingly easy and simple, yet significantly outperforms all prior approaches on challenging datasets like GeoNet and DomainNet, validating its extreme effectiveness. To further extend the scope of our study beyond images, we introduce a new benchmark to study ego-exo transfer in videos and find that our language-aided LaGTran yields significant gains in this highly challenging and non-trivial transfer setting. Code, models, and proposed datasets are publicly available at this https URL .
- [943] arXiv:2403.05541 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: AI in ESG for Financial Institutions: An Industrial SurveyComments: 31 pages, 14 tables, 3 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
Abstract: The burgeoning integration of Artificial Intelligence (AI) into Environmental, Social, and Governance (ESG) initiatives within the financial sector represents a paradigm shift towards more sus-tainable and equitable financial practices. This paper surveys the industrial landscape to delineate the necessity and impact of AI in bolstering ESG frameworks. With the advent of stringent regulatory requirements and heightened stakeholder awareness, financial institutions (FIs) are increasingly compelled to adopt ESG criteria. AI emerges as a pivotal tool in navigating the complex in-terplay of financial activities and sustainability goals. Our survey categorizes AI applications across three main pillars of ESG, illustrating how AI enhances analytical capabilities, risk assessment, customer engagement, reporting accuracy and more. Further, we delve into the critical con-siderations surrounding the use of data and the development of models, underscoring the importance of data quality, privacy, and model robustness. The paper also addresses the imperative of responsible and sustainable AI, emphasizing the ethical dimensions of AI deployment in ESG-related banking processes. Conclusively, our findings suggest that while AI offers transformative potential for ESG in banking, it also poses significant challenges that necessitate careful consideration. The final part of the paper synthesizes the survey's insights, proposing a forward-looking stance on the adoption of AI in ESG practices. We conclude with recommendations with a reference architecture for future research and development, advocating for a balanced approach that leverages AI's strengths while mitigating its risks within the ESG domain.
- [944] arXiv:2403.05544 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: From Algorithm Worship to the Art of Human Learning: Insights from 50-year journey of AI in EducationComments: 12 pages; opinion pieceSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Current discourse surrounding Artificial Intelligence (AI) oscillates between hope and apprehension, painting a future where AI reshapes every facet of human life, including Education. This paper delves into the complexities of AI's role in Education, addressing the mixed messages that have both enthused and alarmed educators, policymakers, and the public. It explores the promises that AI holds for enhancing learning through personalisation at scale, against the backdrop of concerns about ethical implications, the devaluation of non-STEM subjects, and the potential transformative impact on our neurocognitive and socio-emotional functioning. Drawing on recent research and global discourse, the paper seeks to unpack the reasons behind the vagueness of current discussions on AI in Education (AIED) and the implications of this ambiguity for future educational practices and policies. By highlighting insights from educational research and synthesising evidence-based best practices in AIED, the aim is to provide a clearer understanding of how AI technologies can be aligned with the fundamental principles of learning and teaching, and explore what concrete actions may need to be prioritised now to truly enhance learning experiences and outcomes for all in the future.
- [945] arXiv:2403.05547 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: AI for non-programmers: Applied AI in the lectures for students without programming skillsComments: 10 pages, 6 figures, Translated from the German of "KI für Nicht-Programmierer*innen: Angewandte KI im Hörsaal für Studierende ohne Programmierkenntnisse". Translated from the German of this https URLJournal-ref: Voneinander Lehren lernen (5) (2024)Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Applications such as ChatGPT and WOMBO Dream make it easy to inspire students without programming knowledge to use artificial intelligence (AI). Therefore, given the increasing importance of AI in all disciplines, innovative strategies are needed to educate students in AI without programming knowledge so that AI can be integrated into their study modules as a future skill. This work presents a didactic planning script for applied AI. The didactic planning script is based on the AI application pipeline and links AI concepts with study-relevant topics. These linkages open up a new solution space and promote students' interest in and understanding of the potentials and risks of AI. An example lecture series for master students in energy management shows how AI can be seamlessly integrated into discipline-specific lectures. To this end, the planning script for applied AI is adapted to fit the study programs' topic. This specific teaching scenario enables students to solve a discipline-specific task step by step using the AI application pipeline. Thus, the application of the didactic planning script for applied AI shows the practical implementation of the theoretical concepts of AI. In addition, a checklist is presented that can be used to assess whether AI can be used in the discipline-specific lecture. AI as a future skill must be learned by students based on use cases that are relevant to the course of studies. For this reason, AI education should fit seamlessly into various curricula, even if the students do not have a programming background due to their field of study.
- [946] arXiv:2403.05548 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Monitoring the evolution of antisemitic discourse on extremist social media using BERTComments: 11 pages; 4 figures; 4 pagesSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: Racism and intolerance on social media contribute to a toxic online environment which may spill offline to foster hatred, and eventually lead to physical violence. That is the case with online antisemitism, the specific category of hatred considered in this study. Tracking antisemitic themes and their associated terminology over time in online discussions could help monitor the sentiments of their participants and their evolution, and possibly offer avenues for intervention that may prevent the escalation of hatred. Due to the large volume and constant evolution of online traffic, monitoring conversations manually is impractical. Instead, we propose an automated method that extracts antisemitic themes and terminology from extremist social media over time and captures their evolution. Since supervised learning would be too limited for such a task, we created an unsupervised online machine learning approach that uses large language models to assess the contextual similarity of posts. The method clusters similar posts together, dividing, and creating additional clusters over time when sub-themes emerge from existing ones or new themes appear. The antisemitic terminology used within each theme is extracted from the posts in each cluster. Our experiments show that our methodology outperforms existing baselines and demonstrates the kind of themes and sub-themes it discovers within antisemitic discourse along with their associated terminology. We believe that our approach will be useful for monitoring the evolution of all kinds of hatred beyond antisemitism on social platforms.
- [947] arXiv:2403.05550 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Teranga Go!: Carpooling Collaborative Consumption Community with multi-criteria hesitant fuzzy linguistic term set opinions to build confidence and trustComments: project at this https URL . arXiv admin note: substantial text overlap with arXiv:2402.01775Journal-ref: Applied Soft Computing 67, 2018, Pages 941-952Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Classic Delphi and Fuzzy Delphi methods are used to test content validity of a data collection tools such as questionnaires. Fuzzy Delphi takes the opinion issued by judges from a linguistic perspective reducing ambiguity in opinions by using fuzzy numbers. We propose an extension named 2-Tuple Fuzzy Linguistic Delphi method to deal with scenarios in which judges show different expertise degrees by using fuzzy multigranular semantics of the linguistic terms and to obtain intermediate and final results expressed by 2-tuple linguistic values. The key idea of our proposal is to validate the full questionnaire by means of the evaluation of its parts, defining the validity of each item as a Decision Making problem. Taking the opinion of experts, we measure the degree of consensus, the degree of consistency, and the linguistic score of each item, in order to detect those items that affect, positively or negatively, the quality of the instrument. Considering the real need to evaluate a b-learning educational experience with a consensual questionnaire, we present a Decision Making model for questionnaire validation that solve it. Additionally, we contribute to this consensus reaching problem by developing an online tool under GPL v3 license. The software visualizes the collective valuations for each iteration and assists to determine which parts of the questionnaire should be modified to reach a consensual solution.
- [948] arXiv:2403.05552 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Multi-source and multimodal data fusion for predicting academic performance in blended learning university coursesJournal-ref: Computers & Electrical Engineering, 89, 106908 (2021)Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper we applied data fusion approaches for predicting the final academic performance of university students using multiple-source, multimodal data from blended learning environments. We collected and preprocessed data about first-year university students from different sources: theory classes, practical sessions, on-line Moodle sessions, and a final exam. Our objective was to discover which data fusion approach produced the best results using our data. We carried out experiments by applying four different data fusion approaches and six classification algorithms. The results showed that the best predictions were produced using ensembles and selecting the best attributes approach with discretized data. The best prediction models showed us that the level of attention in theory classes, scores in Moodle quizzes, and the level of activity in Moodle forums were the best set of attributes for predicting students' final performance in our courses.
- [949] arXiv:2403.05562 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: SDXL Finetuned with LoRA for Coloring Therapy: Generating Graphic Templates Inspired by United Arab Emirates CultureSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: A transformative approach to mental health therapy lies at the crossroads of cultural heritage and advanced technology. This paper introduces an innovative method that fuses machine learning techniques with traditional Emirati motifs, focusing on the United Arab Emirates (UAE). We utilize the Stable Diffusion XL (SDXL) model, enhanced with Low-Rank Adaptation (LoRA), to create culturally significant coloring templates featuring Al-Sadu weaving patterns. This novel approach leverages coloring therapy for its recognized stress-relieving benefits and embeds deep cultural resonance, making it a potent tool for therapeutic intervention and cultural preservation. Specifically targeting Generalized Anxiety Disorder (GAD), our method demonstrates significant potential in reducing associated symptoms. Additionally, the paper delves into the broader implications of color and music therapy, emphasizing the importance of culturally tailored content. The technical aspects of the SDXL model and its LoRA fine-tuning showcase its capability to generate high-quality, culturally specific images. This research stands at the forefront of integrating mental wellness practices with cultural heritage, providing a groundbreaking perspective on the synergy between technology, culture, and healthcare. In future work, we aim to employ biosignals to assess the level of engagement and effectiveness of color therapy. A key focus will be to examine the impact of the Emirati heritage Al Sadu art on Emirati individuals and compare their responses with those of other nationalities. This will provide deeper insights into the cultural specificity of therapeutic interventions and further the understanding of the unique interplay between cultural identity and mental health therapy.
- [950] arXiv:2403.05565 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine LearningJiaqi Ma , Vivian Lai , Yiming Zhang , Chacha Chen , Paul Hamilton , Davor Ljubenkov , Himabindu Lakkaraju , Chenhao TanSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Recently, there has been a surge of explainable AI (XAI) methods driven by the need for understanding machine learning model behaviors in high-stakes scenarios. However, properly evaluating the effectiveness of the XAI methods inevitably requires the involvement of human subjects, and conducting human-centered benchmarks is challenging in a number of ways: designing and implementing user studies is complex; numerous design choices in the design space of user study lead to problems of reproducibility; and running user studies can be challenging and even daunting for machine learning researchers. To address these challenges, this paper presents OpenHEXAI, an open-source framework for human-centered evaluation of XAI methods. OpenHEXAI features (1) a collection of diverse benchmark datasets, pre-trained models, and post hoc explanation methods; (2) an easy-to-use web application for user study; (3) comprehensive evaluation metrics for the effectiveness of post hoc explanation methods in the context of human-AI decision making tasks; (4) best practice recommendations of experiment documentation; and (5) convenient tools for power analysis and cost estimation. OpenHEAXI is the first large-scale infrastructural effort to facilitate human-centered benchmarks of XAI methods. It simplifies the design and implementation of user studies for XAI methods, thus allowing researchers and practitioners to focus on the scientific questions. Additionally, it enhances reproducibility through standardized designs. Based on OpenHEXAI, we further conduct a systematic benchmark of four state-of-the-art post hoc explanation methods and compare their impacts on human-AI decision making tasks in terms of accuracy, fairness, as well as users' trust and understanding of the machine learning model.
- [951] arXiv:2403.05572 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Is ChatGPT More Empathetic than Humans?Comments: 21 pages, 16 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: This paper investigates the empathetic responding capabilities of ChatGPT, particularly its latest iteration, GPT-4, in comparison to human-generated responses to a wide range of emotional scenarios, both positive and negative. We employ a rigorous evaluation methodology, involving a between-groups study with 600 participants, to evaluate the level of empathy in responses generated by humans and ChatGPT. ChatGPT is prompted in two distinct ways: a standard approach and one explicitly detailing empathy's cognitive, affective, and compassionate counterparts. Our findings indicate that the average empathy rating of responses generated by ChatGPT exceeds those crafted by humans by approximately 10%. Additionally, instructing ChatGPT to incorporate a clear understanding of empathy in its responses makes the responses align approximately 5 times more closely with the expectations of individuals possessing a high degree of empathy, compared to human responses. The proposed evaluation framework serves as a scalable and adaptable framework to assess the empathetic capabilities of newer and updated versions of large language models, eliminating the need to replicate the current study's results in future research.
- [952] arXiv:2403.05574 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: HealMe: Harnessing Cognitive Reframing in Large Language Models for PsychotherapyMengxi Xiao , Qianqian Xie , Ziyan Kuang , Zhicheng Liu , Kailai Yang , Min Peng , Weiguang Han , Jimin HuangComments: 17 pages, 4 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large Language Models (LLMs) can play a vital role in psychotherapy by adeptly handling the crucial task of cognitive reframing and overcoming challenges such as shame, distrust, therapist skill variability, and resource scarcity. Previous LLMs in cognitive reframing mainly converted negative emotions to positive ones, but these approaches have limited efficacy, often not promoting clients' self-discovery of alternative perspectives. In this paper, we unveil the Helping and Empowering through Adaptive Language in Mental Enhancement (HealMe) model. This novel cognitive reframing therapy method effectively addresses deep-rooted negative thoughts and fosters rational, balanced perspectives. Diverging from traditional LLM methods, HealMe employs empathetic dialogue based on psychotherapeutic frameworks. It systematically guides clients through distinguishing circumstances from feelings, brainstorming alternative viewpoints, and developing empathetic, actionable suggestions. Moreover, we adopt the first comprehensive and expertly crafted psychological evaluation metrics, specifically designed to rigorously assess the performance of cognitive reframing, in both AI-simulated dialogues and real-world therapeutic conversations. Experimental results show that our model outperforms others in terms of empathy, guidance, and logical coherence, demonstrating its effectiveness and potential positive impact on psychotherapy.
- [953] arXiv:2403.05576 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Understanding Subjectivity through the Lens of Motivational Context in Model-Generated Image SatisfactionSenjuti Dutta , Sherol Chen , Sunny Mak , Amnah Ahmad , Katherine Collins , Alena Butryna , Deepak Ramachandran , Krishnamurthy Dvijotham , Ellie Pavlick , Ravi RajakumarSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Image generation models are poised to become ubiquitous in a range of applications. These models are often fine-tuned and evaluated using human quality judgments that assume a universal standard, failing to consider the subjectivity of such tasks. To investigate how to quantify subjectivity, and the scale of its impact, we measure how assessments differ among human annotators across different use cases. Simulating the effects of ordinarily latent elements of annotators subjectivity, we contrive a set of motivations (t-shirt graphics, presentation visuals, and phone background images) to contextualize a set of crowdsourcing tasks. Our results show that human evaluations of images vary within individual contexts and across combinations of contexts. Three key factors affecting this subjectivity are image appearance, image alignment with text, and representation of objects mentioned in the text. Our study highlights the importance of taking individual users and contexts into account, both when building and evaluating generative models
- [954] arXiv:2403.05578 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Chaining text-to-image and large language model: A novel approach for generating personalized e-commerce bannersShanu Vashishtha , Abhinav Prakash , Lalitesh Morishetti , Kaushiki Nag , Yokila Arora , Sushant Kumar , Kannan AchanComments: 10 pagesSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Text-to-image models such as stable diffusion have opened a plethora of opportunities for generating art. Recent literature has surveyed the use of text-to-image models for enhancing the work of many creative artists. Many e-commerce platforms employ a manual process to generate the banners, which is time-consuming and has limitations of scalability. In this work, we demonstrate the use of text-to-image models for generating personalized web banners with dynamic content for online shoppers based on their interactions. The novelty in this approach lies in converting users' interaction data to meaningful prompts without human intervention. To this end, we utilize a large language model (LLM) to systematically extract a tuple of attributes from item meta-information. The attributes are then passed to a text-to-image model via prompt engineering to generate images for the banner. Our results show that the proposed approach can create high-quality personalized banners for users.
- [955] arXiv:2403.05579 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: Cultural Bias in Explainable AI Research: A Systematic AnalysisSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: For synergistic interactions between humans and artificial intelligence (AI) systems, AI outputs often need to be explainable to people. Explainable AI (XAI) systems are commonly tested in human user studies. However, whether XAI researchers consider potential cultural differences in human explanatory needs remains unexplored. We highlight psychological research that found significant differences in human explanations between many people from Western, commonly individualist countries and people from non-Western, often collectivist countries. We argue that XAI research currently overlooks these variations and that many popular XAI designs implicitly and problematically assume that Western explanatory needs are shared cross-culturally. Additionally, we systematically reviewed over 200 XAI user studies and found that most studies did not consider relevant cultural variations, sampled only Western populations, but drew conclusions about human-XAI interactions more generally. We also analyzed over 30 literature reviews of XAI studies. Most reviews did not mention cultural differences in explanatory needs or flag overly broad cross-cultural extrapolations of XAI user study results. Combined, our analyses provide evidence of a cultural bias toward Western populations in XAI research, highlighting an important knowledge gap regarding how culturally diverse users may respond to widely used XAI systems that future work can and should address.
- [956] arXiv:2403.05581 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Can Interpretability Layouts Influence Human Perception of Offensive Sentences?Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper conducts a user study to assess whether three machine learning (ML) interpretability layouts can influence participants' views when evaluating sentences containing hate speech, focusing on the "Misogyny" and "Racism" classes. Given the existence of divergent conclusions in the literature, we provide empirical evidence on using ML interpretability in online communities through statistical and qualitative analyses of questionnaire responses. The Generalized Additive Model estimates participants' ratings, incorporating within-subject and between-subject designs. While our statistical analysis indicates that none of the interpretability layouts significantly influences participants' views, our qualitative analysis demonstrates the advantages of ML interpretability: 1) triggering participants to provide corrective feedback in case of discrepancies between their views and the model, and 2) providing insights to evaluate a model's behavior beyond traditional performance metrics.
- [957] arXiv:2403.05583 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: A Cross-Modal Approach to Silent Speech with LLM-Enhanced RecognitionSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment through novel loss functions--cross-contrast (crossCon) and supervised temporal contrast (supTcon)--to train a multimodal model with a shared latent representation. This architecture enables the use of audio-only datasets like LibriSpeech to improve silent speech recognition. Additionally, our introduction of Large Language Model (LLM) Integrated Scoring Adjustment (LISA) significantly improves recognition accuracy. Together, MONA LISA reduces the state-of-the-art word error rate (WER) from 28.8% to 12.2% in the Gaddy (2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG recordings, our method improves the state-of-the-art from 23.3% to 3.7% WER. In the Brain-to-Text 2024 competition, LISA performs best, improving the top WER from 9.8% to 8.9%. To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR). Our work not only narrows the performance gap between silent and vocalized speech but also opens new possibilities in human-computer interaction, demonstrating the potential of cross-modal approaches in noisy and data-limited regimes.
- [958] arXiv:2403.05584 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Time2Stop: Adaptive and Explainable Human-AI Loop for Smartphone Overuse InterventionAdiba Orzikulova , Han Xiao , Zhipeng Li , Yukang Yan , Yuntao Wang , Yuanchun Shi , Marzyeh Ghassemi , Sung-Ju Lee , Anind K Dey , Xuhai "Orson" XuSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Despite a rich history of investigating smartphone overuse intervention techniques, AI-based just-in-time adaptive intervention (JITAI) methods for overuse reduction are lacking. We develop Time2Stop, an intelligent, adaptive, and explainable JITAI system that leverages machine learning to identify optimal intervention timings, introduces interventions with transparent AI explanations, and collects user feedback to establish a human-AI loop and adapt the intervention model over time. We conducted an 8-week field experiment (N=71) to evaluate the effectiveness of both the adaptation and explanation aspects of Time2Stop. Our results indicate that our adaptive models significantly outperform the baseline methods on intervention accuracy (>32.8\% relatively) and receptivity (>8.0\%). In addition, incorporating explanations further enhances the effectiveness by 53.8\% and 11.4\% on accuracy and receptivity, respectively. Moreover, Time2Stop significantly reduces overuse, decreasing app visit frequency by 7.0$\sim$8.9\%. Our subjective data also echoed these quantitative measures. Participants preferred the adaptive interventions and rated the system highly on intervention time accuracy, effectiveness, and level of trust. We envision our work can inspire future research on JITAI systems with a human-AI loop to evolve with users.
- [959] arXiv:2403.05585 (cross-list from physics.soc-ph) [ pdf , ps , other ]
-
Title: Plasmon Resonance Model: Investigation of Analysis of Fake News Diffusion Model with Third Mover Intervention Using Soliton Solution in Non-Complete Information Game under Repeated Dilemma ConditionComments: Plasmon Resonance Model, Soliton Solution, Third Mover,Fake News, Non-Complete Information Game, Nonlinear Partial Differential Equations, First Mover, Second Mover, Third Mover, Diffusion Dynamics, Iteration DilemmaSubjects: Physics and Society (physics.soc-ph) ; Artificial Intelligence (cs.AI)
Abstract: In this research note, we propose a new approach to model the fake news diffusion process within the framework of incomplete information games. In particular, we use nonlinear partial differential equations to represent the phenomenon of plasmon resonance, in which the diffusion of fake news is rapidly amplified within a particular social group or communication network, and analyze its dynamics through a soliton solution approach. In addition, we consider how first mover, second mover, and third mover strategies interact within this nonlinear system and contribute to the amplification or suppression of fake news diffusion. The model aims to understand the mechanisms of fake news proliferation and provide insights into how to prevent or combat it. By combining concepts from the social sciences and the physical sciences, this study attempts to develop a new theoretical framework for the contemporary problem of fake news.This paper is partially an attempt to utilize "Generative AI" and was written with educational intent. There are currently no plans for it to become a peer-reviewed paper.
- [960] arXiv:2403.05589 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Ergonomic Design of Computer Laboratory Furniture: Mismatch Analysis Utilizing Anthropometric Data of University StudentsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Many studies have shown how ergonomically designed furniture improves productivity and well-being. As computers have become a part of students' academic lives, they will grow further in the future. We propose anthropometric-based furniture dimensions suitable for university students to improve computer laboratory ergonomics. We collected data from 380 participants and analyzed 11 anthropometric measurements, correlating them to 11 furniture dimensions. Two types of furniture were studied: a non-adjustable chair with a non-adjustable table and an adjustable chair with a non-adjustable table. The mismatch calculation showed a significant difference between furniture dimensions and anthropometric measurements. The one-way ANOVA test with a significance level of 5% also showed a significant difference between proposed and existing furniture dimensions. The proposed dimensions were found to be more compatible and reduced mismatch percentages for both males and females compared to existing furniture. The proposed dimensions of the furniture set with adjustable seat height showed slightly improved results compared to the non-adjustable furniture set. This suggests that the proposed dimensions can improve comfort levels and reduce the risk of musculoskeletal disorders among students. Further studies on the implementation and long-term effects of these proposed dimensions in real-world computer laboratory settings are recommended.
- [961] arXiv:2403.05592 (cross-list from cs.GL) [ pdf , ps , html , other ]
-
Title: Eternal Sunshine of the Mechanical Mind: The Irreconcilability of Machine Learning and the Right to be ForgottenSubjects: General Literature (cs.GL) ; Artificial Intelligence (cs.AI)
Abstract: As we keep rapidly advancing toward an era where artificial intelligence is a constant and normative experience for most of us, we must also be aware of what this vision and this progress entail. By first approximating neural connections and activities in computer circuits and then creating more and more sophisticated versions of this crude approximation, we are now facing an age to come where modern deep learning-based artificial intelligence systems can rightly be called thinking machines, and they are sometimes even lauded for their emergent behavior and black-box approaches. But as we create more powerful electronic brains, with billions of neural connections and parameters, can we guarantee that these mammoths built of artificial neurons will be able to forget the data that we store in them? If they are at some level like a brain, can the right to be forgotten still be protected while dealing with these AIs? The essential gap between machine learning and the RTBF is explored in this article, with a premonition of far-reaching conclusions if the gap is not bridged or reconciled any time soon. The core argument is that deep learning models, due to their structure and size, cannot be expected to forget or delete a data as it would be expected from a tabular database, and they should be treated more like a mechanical brain, albeit still in development.
- [962] arXiv:2403.05593 (cross-list from physics.soc-ph) [ pdf , ps , other ]
-
Title: Introducing First-Principles Calculations: New Approach to Group Dynamics and Bridging Social Phenomena in TeNP-Chain Based Social Dynamics SimulationsComments: TeNP Chains, First-principles calculations, Tellurium nanoparticles (TeNPs), Graphene, Fake news dissemination, Social cohesion, Information Flow Disruption, Quantum Mechanics, Interdisciplinary approach, Misinformation mitigationSubjects: Physics and Society (physics.soc-ph) ; Artificial Intelligence (cs.AI); Physics Education (physics.ed-ph)
Abstract: This note considers an innovative interdisciplinary methodology that bridges the gap between the fundamental principles of quantum mechanics applied to the study of materials such as tellurium nanoparticles (TeNPs) and graphene and the complex dynamics of social systems. The basis for this approach lies in the metaphorical parallels drawn between the structural features of TeNPs and graphene and the behavioral patterns of social groups in the face of misinformation. TeNPs exhibit unique properties such as the strengthening of covalent bonds within telluric chains and the disruption of secondary structure leading to the separation of these chains. This is analogous to increased cohesion within social groups and disruption of information flow between different subgroups, respectively. . Similarly, the outstanding properties of graphene, such as high electrical conductivity, strength, and flexibility, provide additional aspects for understanding the resilience and adaptability of social structures in response to external stimuli such as fake news. This research note proposes a novel metaphorical framework for analyzing the spread of fake news within social groups, analogous to the structural features of telluric nanoparticles (TeNPs). We investigate how the strengthening of covalent bonds within TeNPs reflects the strengthening of social cohesion in groups that share common beliefs and values. This paper is partially an attempt to utilize "Generative AI" and was written with educational intent. There are currently no plans for it to become a peer-reviewed paper.
- [963] arXiv:2403.05606 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Concept-based Interpretable Model for the Diagnosis of Choroid Neoplasias using Multimodal DataYifan Wu , Yang Liu , Yue Yang , Michael S. Yao , Wenli Yang , Xuehui Shi , Lihong Yang , Dongjun Li , Yueming Liu , James C. Gee , Xuan Yang , Wenbin Wei , Shi GuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Diagnosing rare diseases presents a common challenge in clinical practice, necessitating the expertise of specialists for accurate identification. The advent of machine learning offers a promising solution, while the development of such technologies is hindered by the scarcity of data on rare conditions and the demand for models that are both interpretable and trustworthy in a clinical context. Interpretable AI, with its capacity for human-readable outputs, can facilitate validation by clinicians and contribute to medical education. In the current work, we focus on choroid neoplasias, the most prevalent form of eye cancer in adults, albeit rare with 5.1 per million. We built the so-far largest dataset consisting of 750 patients, incorporating three distinct imaging modalities collected from 2004 to 2022. Our work introduces a concept-based interpretable model that distinguishes between three types of choroidal tumors, integrating insights from domain experts via radiological reports. Remarkably, this model not only achieves an F1 score of 0.91, rivaling that of black-box models, but also boosts the diagnostic accuracy of junior doctors by 42%. This study highlights the significant potential of interpretable machine learning in improving the diagnosis of rare diseases, laying a groundwork for future breakthroughs in medical AI that could tackle a wider array of complex health scenarios.
- [964] arXiv:2403.05612 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Unfamiliar Finetuning Examples Control How Language Models HallucinateSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) have a tendency to generate plausible-sounding yet factually incorrect responses, especially when queried on unfamiliar concepts. In this work, we explore the underlying mechanisms that govern how finetuned LLMs hallucinate. Our investigation reveals an interesting pattern: as inputs become more unfamiliar, LLM outputs tend to default towards a ``hedged'' prediction, whose form is determined by how the unfamiliar examples in the finetuning data are supervised. Thus, by strategically modifying these examples' supervision, we can control LLM predictions for unfamiliar inputs (e.g., teach them to say ``I don't know''). Based on these principles, we develop an RL approach that more reliably mitigates hallucinations for long-form generation tasks, by tackling the challenges presented by reward model hallucinations. We validate our findings with a series of controlled experiments in multiple-choice QA on MMLU, as well as long-form biography and book/movie plot generation tasks.
- [965] arXiv:2403.05645 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Geometric Neural Network based on Phase Space for BCI decodingIgor Carrara , Bruno Aristimunha , Marie-Constance Corsi , Raphael Y. de Camargo , Sylvain Chevallier , Théodore PapadopouloSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Abstract: The integration of Deep Learning (DL) algorithms on brain signal analysis is still in its nascent stages compared to their success in fields like Computer Vision, especially in Brain-Computer Interface (BCI), where the brain activity is decoded to control external devices without requiring muscle control. Electroencephalography (EEG) is a widely adopted choice for designing BCI systems due to its non-invasive and cost-effective nature and excellent temporal resolution. Still, it comes at the expense of limited training data, poor signal-to-noise, and a large variability across and within-subject recordings. Finally, setting up a BCI system with many electrodes takes a long time, hindering the widespread adoption of reliable DL architectures in BCIs outside research laboratories. To improve adoption, we need to improve user comfort using, for instance, reliable algorithms that operate with few electrodes. \textbf{Approach:} Our research aims to develop a DL algorithm that delivers effective results with a limited number of electrodes. Taking advantage of the Augmented Covariance Method with SPDNet, we propose the SPDNet$_{\psi}$ architecture and analyze its performance and computational impact, as well as the interpretability of the results. The evaluation is conducted on 5-fold cross-validation, using only three electrodes positioned above the Motor Cortex. The methodology was tested on nearly 100 subjects from several open-source datasets using the Mother Of All BCI Benchmark (MOABB) framework. \textbf{Main results:} The results of our SPDNet$_{\psi}$ demonstrate that the augmented approach combined with the SPDNet significantly outperforms all the current state-of-the-art DL architecture in MI decoding. \textbf{Significance:} This new architecture is explainable, with a low number of trainable parameters and a reduced carbon footprint.
- [966] arXiv:2403.05652 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: What is different between these datasets?Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The performance of machine learning models heavily depends on the quality of input data, yet real-world applications often encounter various data-related challenges. One such challenge could arise when curating training data or deploying the model in the real world - two comparable datasets in the same domain may have different distributions. While numerous techniques exist for detecting distribution shifts, the literature lacks comprehensive approaches for explaining dataset differences in a human-understandable manner. To address this gap, we propose a suite of interpretable methods (toolbox) for comparing two datasets. We demonstrate the versatility of our approach across diverse data modalities, including tabular data, language, images, and signals in both low and high-dimensional settings. Our methods not only outperform comparable and related approaches in terms of explanation quality and correctness, but also provide actionable, complementary insights to understand and mitigate dataset differences effectively.
- [967] arXiv:2403.05658 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Feature CAM: Interpretable AI in Image ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Deep Neural Networks have often been called the black box because of the complex, deep architecture and non-transparency presented by the inner layers. There is a lack of trust to use Artificial Intelligence in critical and high-precision fields such as security, finance, health, and manufacturing industries. A lot of focused work has been done to provide interpretable models, intending to deliver meaningful insights into the thoughts and behavior of neural networks. In our research, we compare the state-of-the-art methods in the Activation-based methods (ABM) for interpreting predictions of CNN models, specifically in the application of Image Classification. We then extend the same for eight CNN-based architectures to compare the differences in visualization and thus interpretability. We introduced a novel technique Feature CAM, which falls in the perturbation-activation combination, to create fine-grained, class-discriminative visualizations. The resulting saliency maps from our experiments proved to be 3-4 times better human interpretable than the state-of-the-art in ABM. At the same time it reserves machine interpretability, which is the average confidence scores in classification.
- [968] arXiv:2403.05681 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: DP-TabICL: In-Context Learning with Differentially Private Tabular DataComments: 15 pages, 2 figures, 9 tablesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. However, it has been shown that LLMs can leak information contained in prompts, and since tabular data often contain sensitive information, understanding how to protect the underlying tabular data used in ICL is a critical area of research. This work serves as an initial investigation into how to use differential privacy (DP) -- the long-established gold standard for data privacy and anonymization -- to protect tabular data used in ICL. Specifically, we investigate the application of DP mechanisms for private tabular ICL via data privatization prior to serialization and prompting. We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL) DP scenarios via injecting noise into individual records or group statistics, respectively. We evaluate our DP-based frameworks on eight real-world tabular datasets and across multiple ICL and DP settings. Our evaluations show that DP-based ICL can protect the privacy of the underlying tabular data while achieving comparable performance to non-LLM baselines, especially under high privacy regimes.
- [969] arXiv:2403.05701 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Are Large Language Models Aligned with People's Social Intuitions for Human-Robot Interactions?Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Large language models (LLMs) are increasingly used in robotics, especially for high-level action planning. Meanwhile, many robotics applications involve human supervisors or collaborators. Hence, it is crucial for LLMs to generate socially acceptable actions that align with people's preferences and values. In this work, we test whether LLMs capture people's intuitions about behavior judgments and communication preferences in human-robot interaction (HRI) scenarios. For evaluation, we reproduce three HRI user studies, comparing the output of LLMs with that of real participants. We find that GPT-4 strongly outperforms other models, generating answers that correlate strongly with users' answers in two studies $\unicode{x2014}$ the first study dealing with selecting the most appropriate communicative act for a robot in various situations ($r_s$ = 0.82), and the second with judging the desirability, intentionality, and surprisingness of behavior ($r_s$ = 0.83). However, for the last study, testing whether people judge the behavior of robots and humans differently, no model achieves strong correlations. Moreover, we show that vision models fail to capture the essence of video stimuli and that LLMs tend to rate different communicative acts and behavior desirability higher than people.
- [970] arXiv:2403.05715 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: A Framework for Effective AI Recommendations in Cyber-Physical-Human SystemsSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Many cyber-physical-human systems (CPHS) involve a human decision-maker who may receive recommendations from an artificial intelligence (AI) platform while holding the ultimate responsibility of making decisions. In such CPHS applications, the human decision-maker may depart from an optimal recommended decision and instead implement a different one for various reasons. In this letter, we develop a rigorous framework to overcome this challenge. In our framework, we consider that humans may deviate from AI recommendations as they perceive and interpret the system's state in a different way than the AI platform. We establish the structural properties of optimal recommendation strategies and develop an approximate human model (AHM) used by the AI. We provide theoretical bounds on the optimality gap that arises from an AHM and illustrate the efficacy of our results in a numerical example.
- [971] arXiv:2403.05720 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Benchmark of Domain-Adapted Large Language Models for Generating Brief Hospital Course SummariesAsad Aali , Dave Van Veen , Yamin Ishraq Arefeen , Jason Hom , Christian Bluethgen , Eduardo Pontes Reis , Sergios Gatidis , Namuun Clifford , Joseph Daws , Arash S. Tehrani , Jangwon Kim , Akshay S. ChaudhariSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Brief hospital course (BHC) summaries are common clinical documents generated by summarizing clinical notes. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as BHC synthesis have not been shown. To enable the adaptation of LLMs for BHC synthesis, we introduce a novel benchmark consisting of a pre-processed dataset extracted from MIMIC-IV notes, encapsulating clinical note, and brief hospital course (BHC) pairs. We assess the performance of two general-purpose LLMs and three healthcare-adapted LLMs to improve BHC synthesis from clinical notes. Using clinical notes as input for generating BHCs, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to three open-source LLMs (Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5, GPT-4). We quantitatively evaluate the performance of these LLMs across varying context-length inputs using conventional natural language similarity metrics. We further perform a qualitative study where five diverse clinicians blindly compare clinician-written BHCs and two LLM-generated BHCs for 30 samples across metrics of comprehensiveness, conciseness, factual correctness, and fluency. Overall, we present a new benchmark and pre-processed dataset for using LLMs in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. We propose our work as a benchmark to motivate future works to adapt and assess the performance of LLMs in BHC synthesis.
- [972] arXiv:2403.05750 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Decoding the AI Pen: Techniques and Challenges in Detecting AI-Generated TextSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large Language Models (LLMs) have revolutionized the field of Natural Language Generation (NLG) by demonstrating an impressive ability to generate human-like text. However, their widespread usage introduces challenges that necessitate thoughtful examination, ethical scrutiny, and responsible practices. In this study, we delve into these challenges, explore existing strategies for mitigating them, with a particular emphasis on identifying AI-generated text as the ultimate solution. Additionally, we assess the feasibility of detection from a theoretical perspective and propose novel research directions to address the current limitations in this domain.
- [973] arXiv:2403.05751 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning ProcessComments: International Conference on Learning Representations (ICLR) 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel Multi-Granularity Time Series Diffusion (MG-TSD) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that the forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels. More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our MG-TSD model outperforms existing time series prediction methods.
- [974] arXiv:2403.05752 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Task-Oriented GNNs Training on Large Knowledge Graphs for Accurate and Efficient ModelingComments: 12 pages,9 Figures, 3 Tables, ICDE:2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: A Knowledge Graph (KG) is a heterogeneous graph encompassing a diverse range of node and edge types. Heterogeneous Graph Neural Networks (HGNNs) are popular for training machine learning tasks like node classification and link prediction on KGs. However, HGNN methods exhibit excessive complexity influenced by the KG's size, density, and the number of node and edge types. AI practitioners handcraft a subgraph of a KG G relevant to a specific task. We refer to this subgraph as a task-oriented subgraph (TOSG), which contains a subset of task-related node and edge types in G. Training the task using TOSG instead of G alleviates the excessive computation required for a large KG. Crafting the TOSG demands a deep understanding of the KG's structure and the task's objectives. Hence, it is challenging and time-consuming. This paper proposes KG-TOSA, an approach to automate the TOSG extraction for task-oriented HGNN training on a large KG. In KG-TOSA, we define a generic graph pattern that captures the KG's local and global structure relevant to a specific task. We explore different techniques to extract subgraphs matching our graph pattern: namely (i) two techniques sampling around targeted nodes using biased random walk or influence scores, and (ii) a SPARQL-based extraction method leveraging RDF engines' built-in indices. Hence, it achieves negligible preprocessing overhead compared to the sampling techniques. We develop a benchmark of real KGs of large sizes and various tasks for node classification and link prediction. Our experiments show that KG-TOSA helps state-of-the-art HGNN methods reduce training time and memory usage by up to 70% while improving the model performance, e.g., accuracy and inference time.
- [975] arXiv:2403.05759 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Membership Testing in Markov Equivalence Classes via Independence Query OraclesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
Abstract: Understanding causal relationships between variables is a fundamental problem with broad impact in numerous scientific fields. While extensive research has been dedicated to learning causal graphs from data, its complementary concept of testing causal relationships has remained largely unexplored. While learning involves the task of recovering the Markov equivalence class (MEC) of the underlying causal graph from observational data, the testing counterpart addresses the following critical question: Given a specific MEC and observational data from some causal graph, can we determine if the data-generating causal graph belongs to the given MEC?
We explore constraint-based testing methods by establishing bounds on the required number of conditional independence tests. Our bounds are in terms of the size of the maximum undirected clique ($s$) of the given MEC. In the worst case, we show a lower bound of $\exp(\Omega(s))$ independence tests. We then give an algorithm that resolves the task with $\exp(O(s))$ tests, matching our lower bound. Compared to the learning problem, where algorithms often use a number of independence tests that is exponential in the maximum in-degree, this shows that testing is relatively easier. In particular, it requires exponentially less independence tests in graphs featuring high in-degrees and small clique sizes. Additionally, using the DAG associahedron, we provide a geometric interpretation of testing versus learning and discuss how our testing result can aid learning. - [976] arXiv:2403.05763 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: HDReason: Algorithm-Hardware Codesign for Hyperdimensional Knowledge Graph ReasoningHanning Chen , Yang Ni , Ali Zakeri , Zhuowen Zou , Sanggeon Yun , Fei Wen , Behnam Khaleghi , Narayan Srinivasa , Hugo Latapie , Mohsen ImaniSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural network (GCN) involve extensive vertex/relation embedding updates and complicated score functions, which are inherently cumbersome for acceleration. As a result, existing accelerator designs are no longer optimal, and a novel algorithm-hardware co-design for KG reasoning is needed.
Recently, brain-inspired HyperDimensional Computing (HDC) has been introduced as a promising solution for lightweight machine learning, particularly for graph learning applications. In this paper, we leverage HDC for an intrinsically more efficient and acceleration-friendly KGC algorithm. We also co-design an acceleration framework named HDReason targeting FPGA platforms. On the algorithm level, HDReason achieves a balance between high reasoning accuracy, strong model interpretability, and less computation complexity. In terms of architecture, HDReason offers reconfigurability, high training throughput, and low energy consumption. When compared with NVIDIA RTX 4090 GPU, the proposed accelerator achieves an average 10.6x speedup and 65x energy efficiency improvement. When conducting cross-models and cross-platforms comparison, HDReason yields an average 4.2x higher performance and 3.4x better energy efficiency with similar accuracy versus the state-of-the-art FPGA-based GCN training platform. - [977] arXiv:2403.05764 (cross-list from quant-ph) [ pdf , ps , html , other ]
-
Title: Investigation into the Potential of Parallel Quantum Annealing for Simultaneous Optimization of Multiple Problems: A Comprehensive StudySubjects: Quantum Physics (quant-ph) ; Artificial Intelligence (cs.AI)
Abstract: Parallel Quantum Annealing is a technique to solve multiple optimization problems simultaneously. Parallel quantum annealing aims to optimize the utilization of available qubits on a quantum topology by addressing multiple independent problems in a single annealing cycle. This study provides insights into the potential and the limitations of this parallelization method. The experiments consisting of two different problems are integrated, and various problem dimensions are explored including normalization techniques using specific methods such as DWaveSampler with Default Embedding, DWaveSampler with Custom Embedding and LeapHybridSampler. This method minimizes idle qubits and holds promise for substantial speed-up, as indicated by the Time-to-Solution (TTS) metric, compared to traditional quantum annealing, which solves problems sequentially and may leave qubits unutilized.
- [978] arXiv:2403.05767 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Extending Activation Steering to Broad Skills and Multiple BehavioursComments: Code is available at: this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: Current large language models have dangerous capabilities, which are likely to become more problematic in the future. Activation steering techniques can be used to reduce risks from these capabilities. In this paper, we investigate the efficacy of activation steering for broad skills and multiple behaviours. First, by comparing the effects of reducing performance on general coding ability and Python-specific ability, we find that steering broader skills is competitive to steering narrower skills. Second, we steer models to become more or less myopic and wealth-seeking, among other behaviours. In our experiments, combining steering vectors for multiple different behaviours into one steering vector is largely unsuccessful. On the other hand, injecting individual steering vectors at different places in a model simultaneously is promising.
- [979] arXiv:2403.05770 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive LearningComments: Accepted by TPAMI 2023Journal-ref: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI,2023)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment. Despite significant advances, conventional VLN agents are trained typically under disturbance-free environments and may easily fail in real-world scenarios, since they are unaware of how to deal with various possible disturbances, such as sudden obstacles or human interruptions, which widely exist and may usually cause an unexpected route deviation. In this paper, we present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents, by requiring them to learn towards deviation-robust navigation. Specifically, a simple yet effective path perturbation scheme is introduced to implement the route deviation, with which the agent is required to still navigate successfully following the original instruction. Since directly enforcing the agent to learn perturbed trajectories may lead to inefficient training, a progressively perturbed trajectory augmentation strategy is designed, where the agent can self-adaptively learn to navigate under perturbation with the improvement of its navigation performance for each specific trajectory. For encouraging the agent to well capture the difference brought by perturbation, a perturbation-aware contrastive learning mechanism is further developed by contrasting perturbation-free trajectory encodings and perturbation-based counterparts. Extensive experiments on R2R show that PROPER can benefit multiple VLN baselines in perturbation-free scenarios. We further collect the perturbed path data to construct an introspection subset based on the R2R, called Path-Perturbed R2R (PP-R2R). The results on PP-R2R show unsatisfying robustness of popular VLN agents and the capability of PROPER in improving the navigation robustness.
- [980] arXiv:2403.05788 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in SummarizationComments: EACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Text summarization and simplification are among the most widely used applications of AI. However, models developed for such tasks are often prone to hallucination, which can result from training on unaligned data. One efficient approach to address this issue is Loss Truncation (LT) (Kang and Hashimoto, 2020), an approach to modify the standard log loss to adaptively remove noisy examples during training. However, we find that LT alone yields a considerable number of hallucinated entities on various datasets. We study the behavior of the underlying losses between factual and non-factual examples, to understand and refine the performance of LT. We demonstrate that LT's performance is limited when the underlying assumption that noisy targets have higher NLL loss is not satisfied, and find that word-level NLL among entities provides better signal for distinguishing factuality. We then leverage this to propose a fine-grained NLL loss and fine-grained data cleaning strategies, and observe improvements in hallucination reduction across some datasets. Our work is available at https://https://github.com/yale-nlp/fine-grained-lt.
- [981] arXiv:2403.05789 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ItD: Large Language Models Can Teach Themselves Induction through DeductionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Although Large Language Models (LLMs) are showing impressive performance on a wide range of Natural Language Processing tasks, researchers have found that they still have limited ability to conduct induction. Recent works mainly adopt ``post processes'' paradigms to improve the performance of LLMs on induction (e.g., the hypothesis search & refinement methods), but their performance is still constrained by the inherent inductive capability of the LLMs. In this paper, we propose a novel framework, Induction through Deduction (ItD), to enable the LLMs to teach themselves induction through deduction. The ItD framework is composed of two main components: a Deductive Data Generation module to generate induction data and a Naive Bayesian Induction module to optimize the fine-tuning and decoding of LLMs. Our empirical results showcase the effectiveness of ItD on two induction benchmarks, achieving relative performance improvement of 36% and 10% compared with previous state-of-the-art, respectively. Our ablation study verifies the effectiveness of two key modules of ItD. We also verify the effectiveness of ItD across different LLMs and deductors. The data and code of this paper can be found at https://anonymous.4open.science/r/ItD-E844.
- [982] arXiv:2403.05794 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Privacy-Preserving Diffusion Model Using Homomorphic EncryptionSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we introduce a privacy-preserving stable diffusion framework leveraging homomorphic encryption, called HE-Diffusion, which primarily focuses on protecting the denoising phase of the diffusion process. HE-Diffusion is a tailored encryption framework specifically designed to align with the unique architecture of stable diffusion, ensuring both privacy and functionality. To address the inherent computational challenges, we propose a novel min-distortion method that enables efficient partial image encryption, significantly reducing the overhead without compromising the model's output quality. Furthermore, we adopt a sparse tensor representation to expedite computational operations, enhancing the overall efficiency of the privacy-preserving diffusion process. We successfully implement HE-based privacy-preserving stable diffusion inference. The experimental results show that HE-Diffusion achieves 500 times speedup compared with the baseline method, and reduces time cost of the homomorphically encrypted inference to the minute level. Both the performance and accuracy of the HE-Diffusion are on par with the plaintext counterpart. Our approach marks a significant step towards integrating advanced cryptographic techniques with state-of-the-art generative models, paving the way for privacy-preserving and efficient image generation in critical applications.
- [983] arXiv:2403.05810 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Recurrent Aligned Network for Generalized Pedestrian Trajectory PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Pedestrian trajectory prediction is a crucial component in computer vision and robotics, but remains challenging due to the domain shift problem. Previous studies have tried to tackle this problem by leveraging a portion of the trajectory data from the target domain to adapt the model. However, such domain adaptation methods are impractical in real-world scenarios, as it is infeasible to collect trajectory data from all potential target domains. In this paper, we study a task named generalized pedestrian trajectory prediction, with the aim of generalizing the model to unseen domains without accessing their trajectories. To tackle this task, we introduce a Recurrent Aligned Network~(RAN) to minimize the domain gap through domain alignment. Specifically, we devise a recurrent alignment module to effectively align the trajectory feature spaces at both time-state and time-sequence levels by the recurrent alignment strategy.Furthermore, we introduce a pre-aligned representation module to combine social interactions with the recurrent alignment strategy, which aims to consider social interactions during the alignment process instead of just target trajectories. We extensively evaluate our method and compare it with state-of-the-art methods on three widely used benchmarks. The experimental results demonstrate the superior generalization capability of our method. Our work not only fills the gap in the generalization setting for practical pedestrian trajectory prediction but also sets strong baselines in this field.
- [984] arXiv:2403.05812 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Algorithmic progress in language modelsAnson Ho , Tamay Besiroglu , Ege Erdil , David Owen , Robi Rahman , Zifan Carl Guo , David Atkinson , Neil Thompson , Jaime SevillaSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. We estimate augmented scaling laws, which enable us to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, our analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. Though limited by noisy benchmark data, our analysis quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms.
- [985] arXiv:2403.05814 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge GraphsComments: 20 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Despite advancements in on-topic dialogue systems, effectively managing topic shifts within dialogues remains a persistent challenge, largely attributed to the limited availability of training datasets. To address this issue, we propose Multi-Passage to Dialogue (MP2D), a data generation framework that automatically creates conversational question-answering datasets with natural topic transitions. By leveraging the relationships between entities in a knowledge graph, MP2D maps the flow of topics within a dialogue, effectively mirroring the dynamics of human conversation. It retrieves relevant passages corresponding to the topics and transforms them into dialogues through the passage-to-dialogue method. Through quantitative and qualitative experiments, we demonstrate MP2D's efficacy in generating dialogue with natural topic shifts. Furthermore, this study introduces a novel benchmark for topic shift dialogues, TS-WikiDialog. Utilizing the dataset, we demonstrate that even Large Language Models (LLMs) struggle to handle topic shifts in dialogue effectively, and we showcase the performance improvements of models trained on datasets generated by MP2D across diverse topic shift dialogue tasks.
- [986] arXiv:2403.05828 (cross-list from quant-ph) [ pdf , ps , html , other ]
-
Title: Multi-GPU-Enabled Hybrid Quantum-Classical Workflow in Quantum-HPC Middleware: Applications in Quantum SimulationsComments: 8 pages, 8 figuresSubjects: Quantum Physics (quant-ph) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Achieving high-performance computation on quantum systems presents a formidable challenge that necessitates bridging the capabilities between quantum hardware and classical computing resources. This study introduces an innovative distribution-aware Quantum-Classical-Quantum (QCQ) architecture, which integrates cutting-edge quantum software framework works with high-performance classical computing resources to address challenges in quantum simulation for materials and condensed matter physics. At the heart of this architecture is the seamless integration of VQE algorithms running on QPUs for efficient quantum state preparation, Tensor Network states, and QCNNs for classifying quantum states on classical hardware.
For benchmarking quantum simulators, the QCQ architecture utilizes the cuQuantum SDK to leverage multi-GPU acceleration, integrated with PennyLane's Lightning plugin, demonstrating up to tenfold increases in computational speed for complex phase transition classification tasks compared to traditional CPU-based methods. This significant acceleration enables models such as the transverse field Ising and XXZ systems to accurately predict phase transitions with a 99.5% accuracy. The architecture's ability to distribute computation between QPUs and classical resources addresses critical bottlenecks in Quantum-HPC, paving the way for scalable quantum simulation.
The QCQ framework embodies a synergistic combination of quantum algorithms, machine learning, and Quantum-HPC capabilities, enhancing its potential to provide transformative insights into the behavior of quantum systems across different scales. As quantum hardware continues to improve, this hybrid distribution-aware framework will play a crucial role in realizing the full potential of quantum computing by seamlessly integrating distributed quantum resources with the state-of-the-art classical computing infrastructure. - [987] arXiv:2403.05839 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Long-term Frame-Event Visual Tracking: Benchmark Dataset and BaselineComments: In Peer ReviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Current event-/frame-event based trackers undergo evaluation on short-term tracking datasets, however, the tracking of real-world scenarios involves long-term tracking, and the performance of existing tracking algorithms in these scenarios remains unclear. In this paper, we first propose a new long-term and large-scale frame-event single object tracking dataset, termed FELT. It contains 742 videos and 1,594,474 RGB frames and event stream pairs and has become the largest frame-event tracking dataset to date. We re-train and evaluate 15 baseline trackers on our dataset for future works to compare. More importantly, we find that the RGB frames and event streams are naturally incomplete due to the influence of challenging factors and spatially sparse event flow. In response to this, we propose a novel associative memory Transformer network as a unified backbone by introducing modern Hopfield layers into multi-head self-attention blocks to fuse both RGB and event data. Extensive experiments on RGB-Event (FELT), RGB-Thermal (RGBT234, LasHeR), and RGB-Depth (DepthTrack) datasets fully validated the effectiveness of our model. The dataset and source code can be found at \url{ this https URL }.
- [988] arXiv:2403.05842 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation EquivarianceSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: With the blossom of deep learning models and services, it has become an imperative concern to safeguard the valuable model parameters from being stolen. Watermarking is considered an important tool for ownership verification. However, current watermarking schemes are customized for different models and tasks, hard to be integrated as an integrated intellectual protection service. We propose Hufu, a modality-agnostic watermarking system for pre-trained Transformer-based models, relying on the permutation equivariance property of Transformers. Hufu embeds watermark by fine-tuning the pre-trained model on a set of data samples specifically permuted, and the embedded model essentially contains two sets of weights -- one for normal use and the other for watermark extraction which is triggered on permuted inputs. The permutation equivariance ensures minimal interference between these two sets of model weights and thus high fidelity on downstream tasks. Since our method only depends on the model itself, it is naturally modality-agnostic, task-independent, and trigger-sample-free. Extensive experiments on the state-of-the-art vision Transformers, BERT, and GPT2 have demonstrated Hufu's superiority in meeting watermarking requirements including effectiveness, efficiency, fidelity, and robustness, showing its great potential to be deployed as a uniform ownership verification service for various Transformers.
- [989] arXiv:2403.05845 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reverse That Number! Decoding Order Matters in Arithmetic LearningDaniel Zhang-Li , Nianyi Lin , Jifan Yu , Zheyuan Zhang , Zijun Yao , Xiaokang Zhang , Lei Hou , Jing Zhang , Juanzi LiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in pretraining have demonstrated that modern Large Language Models (LLMs) possess the capability to effectively learn arithmetic operations. However, despite acknowledging the significance of digit order in arithmetic computation, current methodologies predominantly rely on sequential, step-by-step approaches for teaching LLMs arithmetic, resulting in a conclusion where obtaining better performance involves fine-grained step-by-step. Diverging from this conventional path, our work introduces a novel strategy that not only reevaluates the digit order by prioritizing output from the least significant digit but also incorporates a step-by-step methodology to substantially reduce complexity. We have developed and applied this method in a comprehensive set of experiments. Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training. For the purpose of facilitating replication and further research, we have made our code and dataset publicly available at \url{https://anonymous.4open.science/r/RAIT-9FB7/}.
- [990] arXiv:2403.05911 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Towards Optimizing Human-Centric Objectives in AI-Assisted Decision-Making With Offline Reinforcement LearningSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Imagine if AI decision-support tools not only complemented our ability to make accurate decisions, but also improved our skills, boosted collaboration, and elevated the joy we derive from our tasks. Despite the potential to optimize a broad spectrum of such human-centric objectives, the design of current AI tools remains focused on decision accuracy alone. We propose offline reinforcement learning (RL) as a general approach for modeling human-AI decision-making to optimize human-AI interaction for diverse objectives. RL can optimize such objectives by tailoring decision support, providing the right type of assistance to the right person at the right time. We instantiated our approach with two objectives: human-AI accuracy on the decision-making task and human learning about the task and learned decision support policies from previous human-AI interaction data. We compared the optimized policies against several baselines in AI-assisted decision-making. Across two experiments (N=316 and N=964), our results demonstrated that people interacting with policies optimized for accuracy achieve significantly better accuracy -- and even human-AI complementarity -- compared to those interacting with any other type of AI support. Our results further indicated that human learning was more difficult to optimize than accuracy, with participants who interacted with learning-optimized policies showing significant learning improvement only at times. Our research (1) demonstrates offline RL to be a promising approach to model human-AI decision-making, leading to policies that may optimize human-centric objectives and provide novel insights about the AI-assisted decision-making space, and (2) emphasizes the importance of considering human-centric objectives beyond decision accuracy in AI-assisted decision-making, opening up the novel research challenge of optimizing human-AI interaction for such objectives.
- [991] arXiv:2403.05916 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective ComputingHao Lu , Xuesong Niu , Jiyao Wang , Yin Wang , Qingyong Hu , Jiaqi Tang , Yuting Zhang , Kaishen Yuan , Bin Huang , Zitong Yu , Dengbo He , Shuiguang Deng , Hao Chen , Yingcong Chen , Shiguang ShanSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks. The results show that \gpt has high accuracy in facial action unit recognition and micro-expression detection while its general facial expression recognition performance is not accurate. We also highlight the challenges of achieving fine-grained micro-expression recognition and the potential for further study and demonstrate the versatility and potential of \gpt for handling advanced tasks in emotion recognition and related fields by integrating with task-related agents for more complex tasks, such as heart rate estimation through signal processing. In conclusion, this paper provides valuable insights into the potential applications and challenges of MLLMs in human-centric computing. Our interesting examples are at this https URL .
- [992] arXiv:2403.05918 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: SEMRes-DDPM: Residual Network Based Diffusion Modelling Applied to Imbalanced DataComments: NoneSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In the field of data mining and machine learning, commonly used classification models cannot effectively learn in unbalanced data. In order to balance the data distribution before model training, oversampling methods are often used to generate data for a small number of classes to solve the problem of classifying unbalanced data. Most of the classical oversampling methods are based on the SMOTE technique, which only focuses on the local information of the data, and therefore the generated data may have the problem of not being realistic enough. In the current oversampling methods based on generative networks, the methods based on GANs can capture the true distribution of data, but there is the problem of pattern collapse and training instability in training; in the oversampling methods based on denoising diffusion probability models, the neural network of the inverse diffusion process using the U-Net is not applicable to tabular data, and although the MLP can be used to replace the U-Net, the problem exists due to the simplicity of the structure and the poor effect of removing noise. problem of poor noise removal. In order to overcome the above problems, we propose a novel oversampling method this http URL the SEMRes-DDPM backward diffusion process, a new neural network structure SEMST-ResNet is used, which is suitable for tabular data and has good noise removal effect, and it can generate tabular data with higher quality. Experiments show that the SEMResNet network removes noise better than MLP; SEMRes-DDPM generates data distributions that are closer to the real data distributions than TabDDPM with CWGAN-GP; on 20 real unbalanced tabular datasets with 9 classification models, SEMRes-DDPM improves the quality of the generated tabular data in terms of three evaluation metrics (F1, G-mean, AUC) with better classification performance than other SOTA oversampling methods.
- [993] arXiv:2403.05920 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: High Throughput Phenotyping of Physician Notes with Large Language and Hybrid NLP ModelsComments: Submitted to IEEE EMBS Summer conference 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past thirty years, progress toward making high throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping of physician notes.
- [994] arXiv:2403.05932 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learned 3D volumetric recovery of clouds and its uncertainty for climate analysisSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Significant uncertainty in climate prediction and cloud physics is tied to observational gaps relating to shallow scattered clouds. Addressing these challenges requires remote sensing of their three-dimensional (3D) heterogeneous volumetric scattering content. This calls for passive scattering computed tomography (CT). We design a learning-based model (ProbCT) to achieve CT of such clouds, based on noisy multi-view spaceborne images. ProbCT infers - for the first time - the posterior probability distribution of the heterogeneous extinction coefficient, per 3D location. This yields arbitrary valuable statistics, e.g., the 3D field of the most probable extinction and its uncertainty. ProbCT uses a neural-field representation, making essentially real-time inference. ProbCT undergoes supervised training by a new labeled multi-class database of physics-based volumetric fields of clouds and their corresponding images. To improve out-of-distribution inference, we incorporate self-supervised learning through differential rendering. We demonstrate the approach in simulations and on real-world data, and indicate the relevance of 3D recovery and uncertainty to precipitation and renewable energy.
- [995] arXiv:2403.05950 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Classifying Objects in 3D Point Clouds Using Recurrent Neural Network: A GRU LSTM Hybrid ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Accurate classification of objects in 3D point clouds is a significant problem in several applications, such as autonomous navigation and augmented/virtual reality scenarios, which has become a research hot spot. In this paper, we presented a deep learning strategy for 3D object classification in augmented reality. The proposed approach is a combination of the GRU and LSTM. LSTM networks learn longer dependencies well, but due to the number of gates, it takes longer to train; on the other hand, GRU networks have a weaker performance than LSTM, but their training speed is much higher than GRU, which is The speed is due to its fewer gates. The proposed approach used the combination of speed and accuracy of these two networks. The proposed approach achieved an accuracy of 0.99 in the 4,499,0641 points dataset, which includes eight classes (unlabeled, man-made terrain, natural terrain, high vegetation, low vegetation, buildings, hardscape, scanning artifacts, cars). Meanwhile, the traditional machine learning approaches could achieve a maximum accuracy of 0.9489 in the best case. Keywords: Point Cloud Classification, Virtual Reality, Hybrid Model, GRULSTM, GRU, LSTM
- [996] arXiv:2403.05973 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Calibrating Large Language Models Using Their Generations OnlySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: As large language models (LLMs) are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary prediction of confidence targets): A method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone. This approach has several advantages: It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages, for instance by verbalizing the predicted confidence or adjusting the given answer based on the confidence. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.
- [997] arXiv:2403.05996 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Dissecting Deep RL with High Update Ratios: Combatting Value Overestimation and DivergenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We show that deep reinforcement learning can maintain its ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples. Under such large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we dissect the phenomena underlying the primacy bias. We inspect the early stages of training that ought to cause the failure to learn and find that a fundamental challenge is a long-standing acquaintance: value overestimation. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be traced to unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting on early data.
- [998] arXiv:2403.06003 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: A Generalized Acquisition Function for Preference-based Reward LearningSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. The information gain criterion focuses on precisely identifying all parameters of the reward function. This can potentially be wasteful as many parameters may result in the same reward, and many rewards may result in the same behavior in the downstream tasks. Instead, we show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar. We introduce a tractable framework that can capture such definitions of similarity. Our experiments in a synthetic environment, an assistive robotics environment with domain transfer, and a natural language processing problem with real datasets demonstrate the superior performance of our querying method over the state-of-the-art information gain method.
- [999] arXiv:2403.06014 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Hard-label based Small Query Black-box Adversarial AttackComments: 11 pages, 3 figuresJournal-ref: IEEE/CVF Winter Conference on Applications of Computer Vision, 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: We consider the hard label based black box adversarial attack setting which solely observes predicted classes from the target model. Most of the attack methods in this setting suffer from impractical number of queries required to achieve a successful attack. One approach to tackle this drawback is utilising the adversarial transferability between white box surrogate models and black box target model. However, the majority of the methods adopting this approach are soft label based to take the full advantage of zeroth order optimisation. Unlike mainstream methods, we propose a new practical setting of hard label based attack with an optimisation process guided by a pretrained surrogate model. Experiments show the proposed method significantly improves the query efficiency of the hard label based black-box attack across various target model architectures. We find the proposed method achieves approximately 5 times higher attack success rate compared to the benchmarks, especially at the small query budgets as 100 and 250.
- [1000] arXiv:2403.06018 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Few-Shot Cross-Lingual Transfer for Prompting Large Language Models in Low-Resource LanguagesComments: 47 pages, 26 figures; a thesis submitted in partial satisfaction of the requirements for the degree of Bachelor of Science in Computer Science at the University of California - Santa CruzSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large pre-trained language models (PLMs) are at the forefront of advances in Natural Language Processing. One widespread use case of PLMs is "prompting" - or in-context learning - where a user provides a description of a task and some completed examples of the task to a PLM as context before prompting the PLM to perform the task on a new example. Only the largest, most capable PLMs are able to perform in-context learning effectively, and these models are typically trained with a predominantly English corpus, leaving all other languages behind. The data limitations in most languages preclude the training of language-specific PLMs capable of prompting. Albeit the surge in work of prompting settings, it is still unclear how PLMs should be adapted cross-lingually specifically for prompting. We evaluate the possible methods to adapt LLaMa, a 7B parameter open-source PLM mainly trained in English, for prompting in low-resource languages, namely for Kinyarwanda, Hausa, and Luganda. We consider three methods: few-shot prompting (prompt), language-adaptive fine-tuning (LAFT), and neural machine translation (translate), and evaluate on abstractive summarization, multi-class topic classification, and named-entity recognition. Although LAFT carries the greatest compute cost and intuitively should lead to the best results, our experiments exhibit that LAFT is only occasionally the optimal choice for adapting PLMs for prompting. Rather, the translate and prompt settings are a compute-efficient and cost-effective method of few-shot prompting for the selected low-resource languages. We find that the results are task and language dependent but find that the prompting method is the best on average across all tasks and languages. Results show that the prompt setting performs better than both translating and LAFT with statistical significance for all shots when aggregated across all tasks and languages.
- [1001] arXiv:2403.06025 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CarbonNet: How Computer Vision Plays a Role in Climate Change? Application: Learning Geomechanics from Subsurface Geometry of CCS to Mitigate Global WarmingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We introduce a new approach using computer vision to predict the land surface displacement from subsurface geometry images for Carbon Capture and Sequestration (CCS). CCS has been proved to be a key component for a carbon neutral society. However, scientists see there are challenges along the way including the high computational cost due to the large model scale and limitations to generalize a pre-trained model with complex physics. We tackle those challenges by training models directly from the subsurface geometry images. The goal is to understand the respons of land surface displacement due to carbon injection and utilize our trained models to inform decision making in CCS projects.
We implement multiple models (CNN, ResNet, and ResNetUNet) for static mechanics problem, which is a image prediction problem. Next, we use the LSTM and transformer for transient mechanics scenario, which is a video prediction problem. It shows ResNetUNet outperforms the others thanks to its architecture in static mechanics problem, and LSTM shows comparable performance to transformer in transient problem. This report proceeds by outlining our dataset in detail followed by model descriptions in method section. Result and discussion state the key learning, observations, and conclusion with future work rounds out the paper. - [1002] arXiv:2403.06026 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards a Generic Representation of Combinatorial Problems for Learning-Based ApproachesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, there has been a growing interest in using learning-based approaches for solving combinatorial problems, either in an end-to-end manner or in conjunction with traditional optimization algorithms. In both scenarios, the challenge lies in encoding the targeted combinatorial problems into a structure compatible with the learning algorithm. Many existing works have proposed problem-specific representations, often in the form of a graph, to leverage the advantages of \textit{graph neural networks}. However, these approaches lack generality, as the representation cannot be easily transferred from one combinatorial problem to another one. While some attempts have been made to bridge this gap, they still offer a partial generality only. In response to this challenge, this paper advocates for progress toward a fully generic representation of combinatorial problems for learning-based approaches. The approach we propose involves constructing a graph by breaking down any constraint of a combinatorial problem into an abstract syntax tree and expressing relationships (e.g., a variable involved in a constraint) through the edges. Furthermore, we introduce a graph neural network architecture capable of efficiently learning from this representation. The tool provided operates on combinatorial problems expressed in the XCSP3 format, handling all the constraints available in the 2023 mini-track competition. Experimental results on four combinatorial problems demonstrate that our architecture achieves performance comparable to dedicated architectures while maintaining generality. Our code and trained models are publicly available at \url{ this https URL }.
- [1003] arXiv:2403.06031 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FairTargetSim: An Interactive Simulator for Understanding and Explaining the Fairness Effects of Target Variable DefinitionDalia Gala , Milo Phillips-Brown , Naman Goel , Carinal Prunkl , Laura Alvarez Jubete , medb corcoran , Ray Eitel-PorterSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Machine learning requires defining one's target variable for predictions or decisions, a process that can have profound implications on fairness: biases are often encoded in target variable definition itself, before any data collection or training. We present an interactive simulator, FairTargetSim (FTS), that illustrates how target variable definition impacts fairness. FTS is a valuable tool for algorithm developers, researchers, and non-technical stakeholders. FTS uses a case study of algorithmic hiring, using real-world data and user-defined target variables. FTS is open-source and available at: this http URL . The video accompanying this paper is here: this http URL .
- [1004] arXiv:2403.06039 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: A Preliminary Exploration of YouTubers' Use of Generative-AI in Content CreationComments: Accepted at CHI LBW 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Content creators increasingly utilize generative artificial intelligence (Gen-AI) on platforms such as YouTube, TikTok, Instagram, and various blogging sites to produce imaginative images, AI-generated videos, and articles using Large Language Models (LLMs). Despite its growing popularity, there remains an underexplored area concerning the specific domains where AI-generated content is being applied, and the methodologies content creators employ with Gen-AI tools during the creation process. This study initially explores this emerging area through a qualitative analysis of 68 YouTube videos demonstrating Gen-AI usage. Our research focuses on identifying the content domains, the variety of tools used, the activities performed, and the nature of the final products generated by Gen-AI in the context of user-generated content.
- [1005] arXiv:2403.06041 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: MATRIX: Multi-Agent Trajectory Generation with Diverse ContextsComments: IEEE International Conference on Robotics and Automation (ICRA 2024)Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: Data-driven methods have great advantages in modeling complicated human behavioral dynamics and dealing with many human-robot interaction applications. However, collecting massive and annotated real-world human datasets has been a laborious task, especially for highly interactive scenarios. On the other hand, algorithmic data generation methods are usually limited by their model capacities, making them unable to offer realistic and diverse data needed by various application users. In this work, we study trajectory-level data generation for multi-human or human-robot interaction scenarios and propose a learning-based automatic trajectory generation model, which we call Multi-Agent TRajectory generation with dIverse conteXts (MATRIX). MATRIX is capable of generating interactive human behaviors in realistic diverse contexts. We achieve this goal by modeling the explicit and interpretable objectives so that MATRIX can generate human motions based on diverse destinations and heterogeneous behaviors. We carried out extensive comparison and ablation studies to illustrate the effectiveness of our approach across various metrics. We also presented experiments that demonstrate the capability of MATRIX to serve as data augmentation for imitation-based motion planning.
- [1006] arXiv:2403.06054 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Decoupled Data Consistency with Diffusion Purification for Image RestorationSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract: Diffusion models have recently gained traction as a powerful class of deep generative priors, excelling in a wide range of image restoration tasks due to their exceptional ability to model data distributions. To solve image restoration problems, many existing techniques achieve data consistency by incorporating additional likelihood gradient steps into the reverse sampling process of diffusion models. However, the additional gradient steps pose a challenge for real-world practical applications as they incur a large computational overhead, thereby increasing inference time. They also present additional difficulties when using accelerated diffusion model samplers, as the number of data consistency steps is limited by the number of reverse sampling steps. In this work, we propose a novel diffusion-based image restoration solver that addresses these issues by decoupling the reverse process from the data consistency steps. Our method involves alternating between a reconstruction phase to maintain data consistency and a refinement phase that enforces the prior via diffusion purification. Our approach demonstrates versatility, making it highly adaptable for efficient problem-solving in latent space. Additionally, it reduces the necessity for numerous sampling steps through the integration of consistency models. The efficacy of our approach is validated through comprehensive experiments across various image restoration tasks, including image denoising, deblurring, inpainting, and super-resolution.
- [1007] arXiv:2403.06063 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Target-constrained Bidirectional Planning for Generation of Target-oriented Proactive DialogueComments: Accepted by ACM Transactions on Information Systems (TOIS)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Target-oriented proactive dialogue systems aim to lead conversations from a dialogue context toward a pre-determined target, such as making recommendations on designated items or introducing new specific topics. To this end, it is critical for such dialogue systems to plan reasonable actions to drive the conversation proactively, and meanwhile, to plan appropriate topics to move the conversation forward to the target topic smoothly. In this work, we mainly focus on effective dialogue planning for target-oriented dialogue generation. Inspired by decision-making theories in cognitive science, we propose a novel target-constrained bidirectional planning (TRIP) approach, which plans an appropriate dialogue path by looking ahead and looking back. By formulating the planning as a generation task, our TRIP bidirectionally generates a dialogue path consisting of a sequence of <action, topic> pairs using two Transformer decoders. They are expected to supervise each other and converge on consistent actions and topics by minimizing the decision gap and contrastive generation of targets. Moreover, we propose a target-constrained decoding algorithm with a bidirectional agreement to better control the planning process. Subsequently, we adopt the planned dialogue paths to guide dialogue generation in a pipeline manner, where we explore two variants: prompt-based generation and plan-controlled generation. Extensive experiments are conducted on two challenging dialogue datasets, which are re-purposed for exploring target-oriented dialogue. Our automatic and human evaluations demonstrate that the proposed methods significantly outperform various baseline models.
- [1008] arXiv:2403.06064 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: L$^2$GC: Lorentzian Linear Graph Convolutional Networks For Node ClassificationComments: Accepted by LREC-COLING 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Linear Graph Convolutional Networks (GCNs) are used to classify the node in the graph data. However, we note that most existing linear GCN models perform neural network operations in Euclidean space, which do not explicitly capture the tree-like hierarchical structure exhibited in real-world datasets that modeled as graphs. In this paper, we attempt to introduce hyperbolic space into linear GCN and propose a novel framework for Lorentzian linear GCN. Specifically, we map the learned features of graph nodes into hyperbolic space, and then perform a Lorentzian linear feature transformation to capture the underlying tree-like structure of data. Experimental results on standard citation networks datasets with semi-supervised learning show that our approach yields new state-of-the-art results of accuracy 74.7$\%$ on Citeseer and 81.3$\%$ on PubMed datasets. Furthermore, we observe that our approach can be trained up to two orders of magnitude faster than other nonlinear GCN models on PubMed dataset. Our code is publicly available at this https URL .
- [1009] arXiv:2403.06088 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Towards In-Vehicle Multi-Task Facial Attribute Recognition: Investigating Synthetic Data and Vision Foundation ModelsComments: Manuscript under peer reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: In the burgeoning field of intelligent transportation systems, enhancing vehicle-driver interaction through facial attribute recognition, such as facial expression, eye gaze, age, etc., is of paramount importance for safety, personalization, and overall user experience. However, the scarcity of comprehensive large-scale, real-world datasets poses a significant challenge for training robust multi-task models. Existing literature often overlooks the potential of synthetic datasets and the comparative efficacy of state-of-the-art vision foundation models in such constrained settings. This paper addresses these gaps by investigating the utility of synthetic datasets for training complex multi-task models that recognize facial attributes of passengers of a vehicle, such as gaze plane, age, and facial expression. Utilizing transfer learning techniques with both pre-trained Vision Transformer (ViT) and Residual Network (ResNet) models, we explore various training and adaptation methods to optimize performance, particularly when data availability is limited. We provide extensive post-evaluation analysis, investigating the effects of synthetic data distributions on model performance in in-distribution data and out-of-distribution inference. Our study unveils counter-intuitive findings, notably the superior performance of ResNet over ViTs in our specific multi-task context, which is attributed to the mismatch in model complexity relative to task complexity. Our results highlight the challenges and opportunities for enhancing the use of synthetic data and vision foundation models in practical applications.
- [1010] arXiv:2403.06095 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code CompletionComments: Under ReviewSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to \tool is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages \textit{Expand and Refine} retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHyper can be found at~\url{ this https URL }.
- [1011] arXiv:2403.06097 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Can LLM Substitute Human Labeling? A Case Study of Fine-grained Chinese Address Entity Recognition Dataset for UAV DeliveryComments: Accepted by TheWebConf'24 (WWW'24) as a Resource PaperSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: We present CNER-UAV, a fine-grained \textbf{C}hinese \textbf{N}ame \textbf{E}ntity \textbf{R}ecognition dataset specifically designed for the task of address resolution in \textbf{U}nmanned \textbf{A}erial \textbf{V}ehicle delivery systems. The dataset encompasses a diverse range of five categories, enabling comprehensive training and evaluation of NER models. To construct this dataset, we sourced the data from a real-world UAV delivery system and conducted a rigorous data cleaning and desensitization process to ensure privacy and data integrity. The resulting dataset, consisting of around 12,000 annotated samples, underwent human experts and \textbf{L}arge \textbf{L}anguage \textbf{M}odel annotation. We evaluated classical NER models on our dataset and provided in-depth analysis. The dataset and models are publicly available at \url{ this https URL }.
- [1012] arXiv:2403.06108 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Large Language Models on Fine-grained Emotion Detection Dataset with Data Augmentation and Transfer LearningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper delves into enhancing the classification performance on the GoEmotions dataset, a large, manually annotated dataset for emotion detection in text. The primary goal of this paper is to address the challenges of detecting subtle emotions in text, a complex issue in Natural Language Processing (NLP) with significant practical applications. The findings offer valuable insights into addressing the challenges of emotion detection in text and suggest directions for future research, including the potential for a survey paper that synthesizes methods and performances across various datasets in this domain.
- [1013] arXiv:2403.06115 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FMPAF: How Do Fed Chairs Affect the Financial Market? A Fine-grained Monetary Policy Analysis Framework on Their LanguageComments: accepted by AAAI 2024 Workshop: AI in Finance for Social ImpactSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Abstract: The effectiveness of central bank communication is a crucial aspect of monetary policy transmission. While recent research has examined the influence of policy communication by the chairs of the Federal Reserve on various financial variables, much of the literature relies on rule-based or dictionary-based methods in parsing the language of the chairs, leaving nuanced information about policy stance contained in nonverbal emotion out of the analysis. In the current study, we propose the Fine-Grained Monetary Policy Analysis Framework (FMPAF), a novel approach that integrates large language models (LLMs) with regression analysis to provide a comprehensive analysis of the impact of the press-conference communications of chairs of the Federal Reserve on financial markets. We conduct extensive comparisons of model performance under different levels of granularity, modalities, and communication scenarios. Based on our preferred specification, a one-unit increase in the sentiment score is associated with an increase of the price of S\&P 500 Exchange-Traded Fund by approximately 500 basis points, a 15-basis-point decrease in the policy interest rate, while not leading to a significant response in exchange rates.
- [1014] arXiv:2403.06131 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: FedPIT: Towards Privacy-preserving and Few-shot Federated Instruction TuningComments: Work in processSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Instruction tuning has proven essential for enhancing the performance of large language models (LLMs) in generating human-aligned responses. However, collecting diverse, high-quality instruction data for tuning poses challenges, particularly in privacy-sensitive domains. Federated instruction tuning (FedIT) has emerged as a solution, leveraging federated learning from multiple data owners while preserving privacy. Yet, it faces challenges due to limited instruction data and vulnerabilities to training data extraction attacks. To address these issues, we propose a novel federated algorithm, FedPIT, which utilizes LLMs' in-context learning capability to self-generate task-specific synthetic data for training autonomously. Our method employs parameter-isolated training to maintain global parameters trained on synthetic data and local parameters trained on augmented local data, effectively thwarting data extraction attacks. Extensive experiments on real-world medical data demonstrate the effectiveness of FedPIT in improving federated few-shot performance while preserving privacy and robustness against data heterogeneity.
- [1015] arXiv:2403.06135 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: MACE: Mass Concept Erasure in Diffusion ModelsComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper, we introduce MACE, a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast, MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning, collectively eliminating the information of undesirable concepts. Furthermore, MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure, celebrity erasure, explicit content erasure, and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at this https URL .
- [1016] arXiv:2403.06139 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Fine-grainedly Synthesize Streaming Data Based On Large Language Models With Graph Structure Understanding For Data SparsitySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Due to the sparsity of user data, sentiment analysis on user reviews in e-commerce platforms often suffers from poor performance, especially when faced with extremely sparse user data or long-tail labels. Recently, the emergence of LLMs has introduced new solutions to such problems by leveraging graph structures to generate supplementary user profiles. However, previous approaches have not fully utilized the graph understanding capabilities of LLMs and have struggled to adapt to complex streaming data environments. In this work, we propose a fine-grained streaming data synthesis framework that categorizes sparse users into three categories: Mid-tail, Long-tail, and Extreme. Specifically, we design LLMs to comprehensively understand three key graph elements in streaming data, including Local-global Graph Understanding, Second-Order Relationship Extraction, and Product Attribute Understanding, which enables the generation of high-quality synthetic data to effectively address sparsity across different categories. Experimental results on three real datasets demonstrate significant performance improvements, with synthesized data contributing to MSE reductions of 45.85%, 3.16%, and 62.21%, respectively.
- [1017] arXiv:2403.06143 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Fluent: Round-efficient Secure Aggregation for Private Federated LearningSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Federated learning (FL) facilitates collaborative training of machine learning models among a large number of clients while safeguarding the privacy of their local datasets. However, FL remains susceptible to vulnerabilities such as privacy inference and inversion attacks. Single-server secure aggregation schemes were proposed to address these threats. Nonetheless, they encounter practical constraints due to their round and communication complexities. This work introduces Fluent, a round and communication-efficient secure aggregation scheme for private FL. Fluent has several improvements compared to state-of-the-art solutions like Bell et al. (CCS 2020) and Ma et al. (SP 2023): (1) it eliminates frequent handshakes and secret sharing operations by efficiently reusing the shares across multiple training iterations without leaking any private information; (2) it accomplishes both the consistency check and gradient unmasking in one logical step, thereby reducing another round of communication. With these innovations, Fluent achieves the fewest communication rounds (i.e., two in the collection phase) in the malicious server setting, in contrast to at least three rounds in existing schemes. This significantly minimizes the latency for geographically distributed clients; (3) Fluent also introduces Fluent-Dynamic with a participant selection algorithm and an alternative secret sharing scheme. This can facilitate dynamic client joining and enhance the system flexibility and scalability. We implemented Fluent and compared it with existing solutions. Experimental results show that Fluent improves the computational cost by at least 75% and communication overhead by at least 25% for normal clients. Fluent also reduces the communication overhead for the server at the expense of a marginal increase in computational cost.
- [1018] arXiv:2403.06145 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: All-in-one platform for AI R&D in medical imaging, encompassing data collection, selection, annotation, and pre-processingChanghee Han , Kyohei Shibano , Wataru Ozaki , Keishiro Osaki , Takafumi Haraguchi , Daisuke Hirahara , Shumon Kimura , Yasuyuki Kobayashi , Gento MogiComments: 5 pages, 3 figures, accepted to SPIE Medical Imaging 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Deep Learning is advancing medical imaging Research and Development (R&D), leading to the frequent clinical use of Artificial Intelligence/Machine Learning (AI/ML)-based medical devices. However, to advance AI R&D, two challenges arise: 1) significant data imbalance, with most data from Europe/America and under 10% from Asia, despite its 60% global population share; and 2) hefty time and investment needed to curate proprietary datasets for commercial use. In response, we established the first commercial medical imaging platform, encompassing steps like: 1) data collection, 2) data selection, 3) annotation, and 4) pre-processing. Moreover, we focus on harnessing under-represented data from Japan and broader Asia, including Computed Tomography, Magnetic Resonance Imaging, and Whole Slide Imaging scans. Using the collected data, we are preparing/providing ready-to-use datasets for medical AI R&D by 1) offering these datasets to AI firms, biopharma, and medical device makers and 2) using them as training/test data to develop tailored AI solutions for such entities. We also aim to merge Blockchain for data security and plan to synthesize rare disease data via generative AI. DataHub Website: this https URL
- [1019] arXiv:2403.06149 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Can Large Language Models Automatically Score Proficiency of Written Essays?Comments: V2 (published version of LREC-COLING 2024)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Although several methods were proposed to address the problem of automated essay scoring (AES) in the last 50 years, there is still much to desire in terms of effectiveness. Large Language Models (LLMs) are transformer-based models that demonstrate extraordinary capabilities on various tasks. In this paper, we test the ability of LLMs, given their powerful linguistic knowledge, to analyze and effectively score written essays. We experimented with two popular LLMs, namely ChatGPT and Llama. We aim to check if these models can do this task and, if so, how their performance is positioned among the state-of-the-art (SOTA) models across two levels, holistically and per individual writing trait. We utilized prompt-engineering tactics in designing four different prompts to bring their maximum potential to this task. Our experiments conducted on the ASAP dataset revealed several interesting observations. First, choosing the right prompt depends highly on the model and nature of the task. Second, the two LLMs exhibited comparable average performance in AES, with a slight advantage for ChatGPT. Finally, despite the performance gap between the two LLMs and SOTA models in terms of predictions, they provide feedback to enhance the quality of the essays, which can potentially help both teachers and students.
- [1020] arXiv:2403.06168 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DiffuMatting: Synthesizing Arbitrary Objects with Matting-level AnnotationXiaobin Hu , Xu Peng , Donghao Luo , Xiaozhong Ji , Jinlong Peng , Zhengkai Jiang , Jiangning Zhang , Taisong Jin , Chengjie Wang , Rongrong JiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Due to the difficulty and labor-consuming nature of getting highly accurate or matting annotations, there only exists a limited amount of highly accurate labels available to the public. To tackle this challenge, we propose a DiffuMatting which inherits the strong Everything generation ability of diffusion and endows the power of "matting anything". Our DiffuMatting can 1). act as an anything matting factory with high accurate annotations 2). be well-compatible with community LoRAs or various conditional control approaches to achieve the community-friendly art design and controllable generation. Specifically, inspired by green-screen-matting, we aim to teach the diffusion model to paint on a fixed green screen canvas. To this end, a large-scale greenscreen dataset (Green100K) is collected as a training dataset for DiffuMatting. Secondly, a green background control loss is proposed to keep the drawing board as a pure green color to distinguish the foreground and background. To ensure the synthesized object has more edge details, a detailed-enhancement of transition boundary loss is proposed as a guideline to generate objects with more complicated edge structures. Aiming to simultaneously generate the object and its matting annotation, we build a matting head to make a green color removal in the latent space of the VAE decoder. Our DiffuMatting shows several potential applications (e.g., matting-data generator, community-friendly art design and controllable generation). As a matting-data generator, DiffuMatting synthesizes general object and portrait matting sets, effectively reducing the relative MSE error by 15.4% in General Object Matting and 11.4% in Portrait Matting tasks.
- [1021] arXiv:2403.06174 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Domain Adversarial Active Learning for Domain Generalization ClassificationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Domain generalization models aim to learn cross-domain knowledge from source domain data, to improve performance on unknown target domains. Recent research has demonstrated that diverse and rich source domain samples can enhance domain generalization capability. This paper argues that the impact of each sample on the model's generalization ability varies. Despite its small scale, a high-quality dataset can still attain a certain level of generalization ability. Motivated by this, we propose a domain-adversarial active learning (DAAL) algorithm for classification tasks in domain generalization. First, we analyze that the objective of tasks is to maximize the inter-class distance within the same domain and minimize the intra-class distance across different domains. To achieve this objective, we design a domain adversarial selection method that prioritizes challenging samples. Second, we posit that even in a converged model, there are subsets of features that lack discriminatory power within each domain. We attempt to identify these feature subsets and optimize them by a constraint loss. We validate and analyze our DAAL algorithm on multiple domain generalization datasets, comparing it with various domain generalization algorithms and active learning algorithms. Our results demonstrate that the DAAL algorithm can achieve strong generalization ability with fewer data resources, thereby reducing data annotation costs in domain generalization tasks.
- [1022] arXiv:2403.06201 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Are You Being Tracked? Discover the Power of Zero-Shot Trajectory Tracing with LLMs!Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: There is a burgeoning discussion around the capabilities of Large Language Models (LLMs) in acting as fundamental components that can be seamlessly incorporated into Artificial Intelligence of Things (AIoT) to interpret complex trajectories. This study introduces LLMTrack, a model that illustrates how LLMs can be leveraged for Zero-Shot Trajectory Recognition by employing a novel single-prompt technique that combines role-play and think step-by-step methodologies with unprocessed Inertial Measurement Unit (IMU) data. We evaluate the model using real-world datasets designed to challenge it with distinct trajectories characterized by indoor and outdoor scenarios. In both test scenarios, LLMTrack not only meets but exceeds the performance benchmarks set by traditional machine learning approaches and even contemporary state-of-the-art deep learning models, all without the requirement of training on specialized datasets. The results of our research suggest that, with strategically designed prompts, LLMs can tap into their extensive knowledge base and are well-equipped to analyze raw sensor data with remarkable effectiveness.
- [1023] arXiv:2403.06206 (cross-list from cs.IT) [ pdf , ps , html , other ]
-
Title: Limit of the Maximum Random Permutation Set EntropyComments: 22 pages, 5 figuresSubjects: Information Theory (cs.IT) ; Artificial Intelligence (cs.AI)
Abstract: The Random Permutation Set (RPS) is a new type of set proposed recently, which can be regarded as the generalization of evidence theory. To measure the uncertainty of RPS, the entropy of RPS and its corresponding maximum entropy have been proposed. Exploring the maximum entropy provides a possible way of understanding the physical meaning of RPS. In this paper, a new concept, the envelope of entropy function, is defined. In addition, the limit of the envelope of RPS entropy is derived and proved. Compared with the existing method, the computational complexity of the proposed method to calculate the envelope of RPS entropy decreases greatly. The result shows that when $N \to \infty$, the limit form of the envelope of the entropy of RPS converges to $e \times (N!)^2$, which is highly connected to the constant $e$ and factorial. Finally, numerical examples validate the efficiency and conciseness of the proposed envelope, which provides a new insight into the maximum entropy function.
- [1024] arXiv:2403.06213 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: $V_kD:$ Improving Knowledge Distillation using Orthogonal ProjectionsComments: CVPR 2024. Code available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: this https URL
- [1025] arXiv:2403.06225 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MoST: Motion Style Transformer between Diverse Action ContentsComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: While existing motion style transfer methods are effective between two motions with identical content, their performance significantly diminishes when transferring style between motions with different contents. This challenge lies in the lack of clear separation between content and style of a motion. To tackle this challenge, we propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion. Our distinctive approach to achieving the goal of disentanglement is twofold: (1) a new architecture for motion style transformer with `part-attentive style modulator across body parts' and `Siamese encoders that encode style and content features separately'; (2) style disentanglement loss. Our method outperforms existing methods and demonstrates exceptionally high quality, particularly in motion pairs with different contents, without the need for heuristic post-processing. Codes are available at this https URL .
- [1026] arXiv:2403.06235 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Probabilistic Neural CircuitsComments: Proceedings of the AAAI Conference on Artificial IntelligenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Abstract: Probabilistic circuits (PCs) have gained prominence in recent years as a versatile framework for discussing probabilistic models that support tractable queries and are yet expressive enough to model complex probability distributions. Nevertheless, tractability comes at a cost: PCs are less expressive than neural networks. In this paper we introduce probabilistic neural circuits (PNCs), which strike a balance between PCs and neural nets in terms of tractability and expressive power. Theoretically, we show that PNCs can be interpreted as deep mixtures of Bayesian networks. Experimentally, we demonstrate that PNCs constitute powerful function approximators.
- [1027] arXiv:2403.06239 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Cooperative Classification and Rationalization for Graph GeneralizationComments: Accepted to WWW 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph Neural Networks (GNNs) have achieved impressive results in graph classification tasks, but they struggle to generalize effectively when faced with out-of-distribution (OOD) data. Several approaches have been proposed to address this problem. Among them, one solution is to diversify training distributions in vanilla classification by modifying the data environment, yet accessing the environment information is complex. Besides, another promising approach involves rationalization, extracting invariant rationales for predictions. However, extracting rationales is difficult due to limited learning signals, resulting in less accurate rationales and diminished predictions. To address these challenges, in this paper, we propose a Cooperative Classification and Rationalization (C2R) method, consisting of the classification and the rationalization module. Specifically, we first assume that multiple environments are available in the classification module. Then, we introduce diverse training distributions using an environment-conditional generative network, enabling robust graph representations. Meanwhile, the rationalization module employs a separator to identify relevant rationale subgraphs while the remaining non-rationale subgraphs are de-correlated with labels. Next, we align graph representations from the classification module with rationale subgraph representations using the knowledge distillation methods, enhancing the learning signal for rationales. Finally, we infer multiple environments by gathering non-rationale representations and incorporate them into the classification module for cooperative learning. Extensive experimental results on both benchmarks and synthetic datasets demonstrate the effectiveness of C2R. Code is available at this https URL .
- [1028] arXiv:2403.06247 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Text-Guided Variational Image Generation for Industrial Anomaly Detection and SegmentationComments: 18 pages, Accepted to CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object, learned from extensive text library documents, to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge, ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach, surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analysis to enhance the effectiveness of anomaly detection models by utilizing the generated images.
- [1029] arXiv:2403.06259 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Editing Conceptual Knowledge for Large Language ModelsXiaohan Wang , Shengyu Mao , Ningyu Zhang , Shumin Deng , Yunzhi Yao , Yue Shen , Lei Liang , Jinjie Gu , Huajun ChenSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Recently, there has been a growing interest in knowledge editing for Large Language Models (LLMs). Current approaches and evaluations merely explore the instance-level editing, while whether LLMs possess the capability to modify concepts remains unclear. This paper pioneers the investigation of editing conceptual knowledge for LLMs, by constructing a novel benchmark dataset ConceptEdit and establishing a suite of new metrics for evaluation. The experimental results reveal that, although existing editing methods can efficiently modify concept-level definition to some extent, they also have the potential to distort the related instantial knowledge in LLMs, leading to poor performance. We anticipate this can inspire further progress in better understanding LLMs. Our project homepage is available at this https URL .
- [1030] arXiv:2403.06265 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model PerformanceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be viewed as 0-gram language modeling where equal probability is assigned to all tokens. We also demonstrate the empirical importance of compression for downstream success of pre-trained language models. We control the compression ability of several BPE tokenizers by varying the amount of documents available during their training: from 1 million documents to a character-based tokenizer equivalent to no training data at all. We then pre-train English language models based on those tokenizers and fine-tune them over several tasks. We show that there is a correlation between tokenizers' compression and models' downstream performance, suggesting that compression is a reliable intrinsic indicator of tokenization quality. These correlations are more pronounced for generation tasks (over classification) or for smaller models (over large ones). We replicated a representative part of our experiments on Turkish and found similar results, confirming that our results hold for languages with typological characteristics dissimilar to English. We conclude that building better compressing tokenizers is a fruitful avenue for further research and for improving overall model performance.
- [1031] arXiv:2403.06267 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: FARPLS: A Feature-Augmented Robot Trajectory Preference Labeling System to Assist Human Labelers' Preference ElicitationHanfang Lyu , Yuanchen Bai , Xin Liang , Ujaan Das , Chuhan Shi , Leiliang Gong , Yingchi Li , Mingfei Sun , Ming Ge , Xiaojuan MaComments: Accepted to ACM Conference on Intelligent User Interfaces (IUI) 2024, March 18-21, 2024, Greenville, SC, USASubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Preference-based learning aims to align robot task objectives with human values. One of the most common methods to infer human preferences is by pairwise comparisons of robot task trajectories. Traditional comparison-based preference labeling systems seldom support labelers to digest and identify critical differences between complex trajectories recorded in videos. Our formative study (N = 12) suggests that individuals may overlook non-salient task features and establish biased preference criteria during their preference elicitation process because of partial observations. In addition, they may experience mental fatigue when given many pairs to compare, causing their label quality to deteriorate. To mitigate these issues, we propose FARPLS, a Feature-Augmented Robot trajectory Preference Labeling System. FARPLS highlights potential outliers in a wide variety of task features that matter to humans and extracts the corresponding video keyframes for easy review and comparison. It also dynamically adjusts the labeling order according to users' familiarities, difficulties of the trajectory pair, and level of disagreements. At the same time, the system monitors labelers' consistency and provides feedback on labeling progress to keep labelers engaged. A between-subjects study (N = 42, 105 pairs of robot pick-and-place trajectories per person) shows that FARPLS can help users establish preference criteria more easily and notice more relevant details in the presented trajectories than the conventional interface. FARPLS also improves labeling consistency and engagement, mitigating challenges in preference elicitation without raising cognitive loads significantly
- [1032] arXiv:2403.06268 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Physics-Guided Abnormal Trajectory Gap DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Databases (cs.DB); Machine Learning (cs.LG)
Abstract: Given trajectories with gaps (i.e., missing data), we investigate algorithms to identify abnormal gaps in trajectories which occur when a given moving object did not report its location, but other moving objects in the same geographic region periodically did. The problem is important due to its societal applications, such as improving maritime safety and regulatory enforcement for global security concerns such as illegal fishing, illegal oil transfers, and trans-shipments. The problem is challenging due to the difficulty of bounding the possible locations of the moving object during a trajectory gap, and the very high computational cost of detecting gaps in such a large volume of location data. The current literature on anomalous trajectory detection assumes linear interpolation within gaps, which may not be able to detect abnormal gaps since objects within a given region may have traveled away from their shortest path. In preliminary work, we introduced an abnormal gap measure that uses a classical space-time prism model to bound an object's possible movement during the trajectory gap and provided a scalable memoized gap detection algorithm (Memo-AGD). In this paper, we propose a Space Time-Aware Gap Detection (STAGD) approach to leverage space-time indexing and merging of trajectory gaps. We also incorporate a Dynamic Region Merge-based (DRM) approach to efficiently compute gap abnormality scores. We provide theoretical proofs that both algorithms are correct and complete and also provide analysis of asymptotic time complexity. Experimental results on synthetic and real-world maritime trajectory data show that the proposed approach substantially improves computation time over the baseline technique.
- [1033] arXiv:2403.06275 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: UNICORN: Ultrasound Nakagami Imaging via Score Matching and AdaptationComments: 12 pages, 5 figureSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
Abstract: Nakagami imaging holds promise for visualizing and quantifying tissue scattering in ultrasound waves, with potential applications in tumor diagnosis and fat fraction estimation which are challenging to discern by conventional ultrasound B-mode images. Existing methods struggle with optimal window size selection and suffer from estimator instability, leading to degraded resolution images. To address this, here we propose a novel method called UNICORN (Ultrasound Nakagami Imaging via Score Matching and Adaptation), that offers an accurate, closed-form estimator for Nakagami parameter estimation in terms of the score function of ultrasonic envelope. Extensive experiments using simulation and real ultrasound RF data demonstrate UNICORN's superiority over conventional approaches in accuracy and resolution quality.
- [1034] arXiv:2403.06289 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive LearningZijun Long , Lipeng Zhuang , George Killick , Richard McCreadie , Gerardo Aragon Camarasa , Paul HendersonComments: arXiv admin note: substantial text overlap with arXiv:2311.16481Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Human-annotated vision datasets inevitably contain a fraction of human mislabelled examples. While the detrimental effects of such mislabelling on supervised learning are well-researched, their influence on Supervised Contrastive Learning (SCL) remains largely unexplored. In this paper, we show that human-labelling errors not only differ significantly from synthetic label errors, but also pose unique challenges in SCL, different to those in traditional supervised learning methods. Specifically, our results indicate they adversely impact the learning process in the ~99% of cases when they occur as false positive samples. Existing noise-mitigating methods primarily focus on synthetic label errors and tackle the unrealistic setting of very high synthetic noise rates (40-80%), but they often underperform on common image datasets due to overfitting. To address this issue, we introduce a novel SCL objective with robustness to human-labelling errors, SCL-RHE. SCL-RHE is designed to mitigate the effects of real-world mislabelled examples, typically characterized by much lower noise rates (<5%). We demonstrate that SCL-RHE consistently outperforms state-of-the-art representation learning and noise-mitigating methods across various vision benchmarks, by offering improved resilience against human-labelling errors.
- [1035] arXiv:2403.06313 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Optimal Policy Sparsification and Low Rank Decomposition for Deep Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep reinforcement learning(DRL) has shown significant promise in a wide range of applications including computer games and robotics. Yet, training DRL policies consume extraordinary computing resources resulting in dense policies which are prone to overfitting. Moreover, inference with dense DRL policies limit their practical applications, especially in edge computing. Techniques such as pruning and singular value decomposition have been used with deep learning models to achieve sparsification and model compression to limit overfitting and reduce memory consumption. However, these techniques resulted in sub-optimal performance with notable decay in rewards. $L_1$ and $L_2$ regularization techniques have been proposed for neural network sparsification and sparse auto-encoder development, but their implementation in DRL environments has not been apparent. We propose a novel $L_0$-norm-regularization technique using an optimal sparsity map to sparsify DRL policies and promote their decomposition to a lower rank without decay in rewards. We evaluated our $L_0$-norm-regularization technique across five different environments (Cartpole-v1, Acrobat-v1, LunarLander-v2, SuperMarioBros-7.1.v0 and Surgical Robot Learning) using several on-policy and off-policy algorithms. We demonstrated that the $L_0$-norm-regularized DRL policy in the SuperMarioBros environment achieved 93% sparsity and gained 70% compression when subjected to low-rank decomposition, while significantly outperforming the dense policy. Additionally, the $L_0$-norm-regularized DRL policy in the Surgical Robot Learning environment achieved a 36% sparsification and gained 46% compression when decomposed to a lower rank, while being performant. The results suggest that our custom $L_0$-norm-regularization technique for sparsification of DRL policies is a promising avenue to reduce computational resources and limit overfitting.
- [1036] arXiv:2403.06317 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: An End-to-End Deep Learning Generative Framework for Refinable Shape Matching and GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Generative modelling for shapes is a prerequisite for In-Silico Clinical Trials (ISCTs), which aim to cost-effectively validate medical device interventions using synthetic anatomical shapes, often represented as 3D surface meshes. However, constructing AI models to generate shapes closely resembling the real mesh samples is challenging due to variable vertex counts, connectivities, and the lack of dense vertex-wise correspondences across the training data. Employing graph representations for meshes, we develop a novel unsupervised geometric deep-learning model to establish refinable shape correspondences in a latent space, construct a population-derived atlas and generate realistic synthetic shapes. We additionally extend our proposed base model to a joint shape generative-clustering multi-atlas framework to incorporate further variability and preserve more details in the generated shapes. Experimental results using liver and left-ventricular models demonstrate the approach's applicability to computational medicine, highlighting its suitability for ISCTs through a comparative analysis.
- [1037] arXiv:2403.06322 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Leveraging Computer Vision in the Intensive Care Unit (ICU) for Examining Visitation and MobilityScott Siegel , Jiaqing Zhang , Sabyasachi Bandyopadhyay , Subhash Nerella , Brandon Silva , Tezcan Baslanti , Azra Bihorac , Parisa RashidiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Despite the importance of closely monitoring patients in the Intensive Care Unit (ICU), many aspects are still assessed in a limited manner due to the time constraints imposed on healthcare providers. For example, although excessive visitations during rest hours can potentially exacerbate the risk of circadian rhythm disruption and delirium, it is not captured in the ICU. Likewise, while mobility can be an important indicator of recovery or deterioration in ICU patients, it is only captured sporadically or not captured at all. In the past few years, the computer vision field has found application in many domains by reducing the human burden. Using computer vision systems in the ICU can also potentially enable non-existing assessments or enhance the frequency and accuracy of existing assessments while reducing the staff workload. In this study, we leverage a state-of-the-art noninvasive computer vision system based on depth imaging to characterize ICU visitations and patients' mobility. We then examine the relationship between visitation and several patient outcomes, such as pain, acuity, and delirium. We found an association between deteriorating patient acuity and the incidence of delirium with increased visitations. In contrast, self-reported pain, reported using the Defense and Veteran Pain Rating Scale (DVPRS), was correlated with decreased visitations. Our findings highlight the feasibility and potential of using noninvasive autonomous systems to monitor ICU patients.
- [1038] arXiv:2403.06326 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: From Instructions to Constraints: Language Model Alignment with Automatic Constraint VerificationFei Wang , Chao Shang , Sarthak Jain , Shuai Wang , Qiang Ning , Bonan Min , Vittorio Castelli , Yassine Benajiba , Dan RothSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the satisfaction rate of constraints is feasible. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance. Further experiments show that the constraint-following capabilities are transferable.
- [1039] arXiv:2403.06332 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Exploiting the Margin: How Capitalism Fuels AI at the Expense of Minoritized GroupsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: This paper explores the intricate relationship between capitalism, racial injustice, and artificial intelligence (AI), arguing that AI acts as a contemporary vehicle for age-old forms of exploitation. By linking historical patterns of racial and economic oppression with current AI practices, this study illustrates how modern technology perpetuates and deepens societal inequalities. It specifically examines how AI is implicated in the exploitation of marginalized communities through underpaid labor in the gig economy, the perpetuation of biases in algorithmic decision-making, and the reinforcement of systemic barriers that prevent these groups from benefiting equitably from technological advances. Furthermore, the paper discusses the role of AI in extending and intensifying the social, economic, and psychological burdens faced by these communities, highlighting the problematic use of AI in surveillance, law enforcement, and mental health contexts. The analysis concludes with a call for transformative changes in how AI is developed and deployed. Advocating for a reevaluation of the values driving AI innovation, the paper promotes an approach that integrates social justice and equity into the core of technological design and policy. This shift is crucial for ensuring that AI serves as a tool for societal improvement, fostering empowerment and healing rather than deepening existing divides.
- [1040] arXiv:2403.06349 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MOAB: Multi-Modal Outer Arithmetic Block For Fusion Of Histopathological Images And Genetic Data For Brain Tumor GradingOmnia Alwazzan (1 and 2), Abbas Khan (1 and 2), Ioannis Patras (1 and 2), Gregory Slabaugh (1 and 2) ((1) School of Electronic Engineering and Computer Science, Queen Mary University of London, UK, (2) Queen Mary Digital Environment Research Institute (DERI), London, UK)Journal-ref: pages={1--5},year={2023},organization={IEEE}Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Brain tumors are an abnormal growth of cells in the brain. They can be classified into distinct grades based on their growth. Often grading is performed based on a histological image and is one of the most significant predictors of a patients prognosis, the higher the grade, the more aggressive the tumor. Correct diagnosis of a tumor grade remains challenging. Though histopathological grading has been shown to be prognostic, results are subject to interobserver variability, even among experienced pathologists. Recently, the World Health Organization reported that advances in molecular genetics have led to improvements in tumor classification. This paper seeks to integrate histological images and genetic data for improved computer-aided diagnosis. We propose a novel Multi-modal Outer Arithmetic Block (MOAB) based on arithmetic operations to combine latent representations of the different modalities for predicting the tumor grade (Grade \rom{2}, \rom{3} and \rom{4}). Extensive experiments evaluate the effectiveness of our approach. By applying MOAB to The Cancer Genome Atlas (TCGA) glioma dataset, we show that it can improve separation between similar classes (Grade \rom{2} and \rom{3}) and outperform prior state-of-the-art grade classification techniques.
- [1041] arXiv:2403.06356 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Video Generation with Consistency TuningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Currently, various studies have been exploring generation of long videos. However, the generated frames in these videos often exhibit jitter and noise. Therefore, in order to generate the videos without these noise, we propose a novel framework composed of four modules: separate tuning module, average fusion module, combined tuning module, and inter-frame consistency module. By applying our newly proposed modules subsequently, the consistency of the background and foreground in each video frames is optimized. Besides, the experimental results demonstrate that videos generated by our method exhibit a high quality in comparison of the state-of-the-art methods.
- [1042] arXiv:2403.06360 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Human and Automatic Interpretation of Romanian Noun CompoundsComments: 6 pages, 2 figures, 3 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Determining the intended, context-dependent meanings of noun compounds like "shoe sale" and "fire sale" remains a challenge for NLP. Previous work has relied on inventories of semantic relations that capture the different meanings between compound members. Focusing on Romanian compounds, whose morphosyntax differs from that of their English counterparts, we propose a new set of relations and test it with human annotators and a neural net classifier. Results show an alignment of the network's predictions and human judgments, even where the human agreement rate is low. Agreement tracks with the frequency of the selected relations, regardless of structural differences. However, the most frequently selected relation was none of the sixteen labeled semantic relations, indicating the need for a better relation inventory.
- [1043] arXiv:2403.06382 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Pre-Trained Model Recommendation for Downstream Fine-tuningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: As a fundamental problem in transfer learning, model selection aims to rank off-the-shelf pre-trained models and select the most suitable one for the new target task. Existing model selection techniques are often constrained in their scope and tend to overlook the nuanced relationships between models and tasks. In this paper, we present a pragmatic framework \textbf{Fennec}, delving into a diverse, large-scale model repository while meticulously considering the intricate connections between tasks and models. The key insight is to map all models and historical tasks into a transfer-related subspace, where the distance between model vectors and task vectors represents the magnitude of transferability. A large vision model, as a proxy, infers a new task's representation in the transfer space, thereby circumventing the computational burden of extensive forward passes. We also investigate the impact of the inherent inductive bias of models on transfer results and propose a novel method called \textbf{archi2vec} to encode the intricate structures of models. The transfer score is computed through straightforward vector arithmetic with a time complexity of $\mathcal{O}(1)$. Finally, we make a substantial contribution to the field by releasing a comprehensive benchmark. We validate the effectiveness of our framework through rigorous testing on two benchmarks. The benchmark and the code will be publicly available in the near future.
- [1044] arXiv:2403.06397 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DeepSafeMPC: Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement LearningComments: 8 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Safe Multi-agent reinforcement learning (safe MARL) has increasingly gained attention in recent years, emphasizing the need for agents to not only optimize the global return but also adhere to safety requirements through behavioral constraints. Some recent work has integrated control theory with multi-agent reinforcement learning to address the challenge of ensuring safety. However, there have been only very limited applications of Model Predictive Control (MPC) methods in this domain, primarily due to the complex and implicit dynamics characteristic of multi-agent environments. To bridge this gap, we propose a novel method called Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning (DeepSafeMPC). The key insight of DeepSafeMPC is leveraging a entralized deep learning model to well predict environmental dynamics. Our method applies MARL principles to search for optimal solutions. Through the employment of MPC, the actions of agents can be restricted within safe states concurrently. We demonstrate the effectiveness of our approach using the Safe Multi-agent MuJoCo environment, showcasing significant advancements in addressing safety concerns in MARL.
- [1045] arXiv:2403.06398 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: On the Diminishing Returns of Width for Continual LearningComments: 10 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: While deep neural networks have demonstrated groundbreaking performance in various settings, these models often suffer from \emph{catastrophic forgetting} when trained on new tasks in sequence. Several works have empirically demonstrated that increasing the width of a neural network leads to a decrease in catastrophic forgetting but have yet to characterize the exact relationship between width and continual learning. We design one of the first frameworks to analyze Continual Learning Theory and prove that width is directly related to forgetting in Feed-Forward Networks (FFN). Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns. We empirically verify our claims at widths hitherto unexplored in prior studies where the diminishing returns are clearly observed as predicted by our theory.
- [1046] arXiv:2403.06408 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of PerturbationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach "the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization. To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance.
- [1047] arXiv:2403.06410 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Logical Pattern Memory Pre-trained Model for Entailment Tree GenerationComments: Accepted By Coling 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Generating coherent and credible explanations remains a significant challenge in the field of AI. In recent years, researchers have delved into the utilization of entailment trees to depict explanations, which exhibit a reasoning process of how a hypothesis is deduced from the supporting facts. However, existing models often overlook the importance of generating intermediate conclusions with logical consistency from the given facts, leading to inaccurate conclusions and undermining the overall credibility of entailment trees. To address this limitation, we propose the logical pattern memory pre-trained model (LMPM). LMPM incorporates an external memory structure to learn and store the latent representations of logical patterns, which aids in generating logically consistent conclusions. Furthermore, to mitigate the influence of logically irrelevant domain knowledge in the Wikipedia-based data, we introduce an entity abstraction approach to construct the dataset for pre-training LMPM. The experimental results highlight the effectiveness of our approach in improving the quality of entailment tree generation. By leveraging logical entailment patterns, our model produces more coherent and reasonable conclusions that closely align with the underlying premises. Code and Data are released at this https URL
- [1048] arXiv:2403.06420 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language ModelsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Reinforcement learning (RL) has demonstrated its capability in solving various tasks but is notorious for its low sample efficiency. In this paper, we propose RLingua, a framework that can leverage the internal knowledge of large language models (LLMs) to reduce the sample complexity of RL in robotic manipulations. To this end, we first present a method for extracting the prior knowledge of LLMs by prompt engineering so that a preliminary rule-based robot controller for a specific task can be generated in a user-friendly manner. Despite being imperfect, the LLM-generated robot controller is utilized to produce action samples during rollouts with a decaying probability, thereby improving RL's sample efficiency. We employ TD3, the widely-used RL baseline method, and modify the actor loss to regularize the policy learning towards the LLM-generated controller. RLingua also provides a novel method of improving the imperfect LLM-generated robot controllers by RL. We demonstrate that RLingua can significantly reduce the sample complexity of TD3 in four robot tasks of panda_gym and achieve high success rates in 12 sampled sparsely rewarded robot tasks in RLBench, where the standard TD3 fails. Additionally, We validated RLingua's effectiveness in real-world robot experiments through Sim2Real, demonstrating that the learned policies are effectively transferable to real robot tasks. Further details about our work are available at our project website this https URL .
- [1049] arXiv:2403.06425 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Differential Geometric View and Explainability of GNN on Evolving GraphsComments: Accepted into ICLR 2023Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graphs are ubiquitous in social networks and biochemistry, where Graph Neural Networks (GNN) are the state-of-the-art models for prediction. Graphs can be evolving and it is vital to formally model and understand how a trained GNN responds to graph evolution. We propose a smooth parameterization of the GNN predicted distributions using axiomatic attribution, where the distributions are on a low-dimensional manifold within a high-dimensional embedding space. We exploit the differential geometric viewpoint to model distributional evolution as smooth curves on the manifold. We reparameterize families of curves on the manifold and design a convex optimization problem to find a unique curve that concisely approximates the distributional evolution for human interpretation. Extensive experiments on node classification, link prediction, and graph classification tasks with evolving graphs demonstrate the better sparsity, faithfulness, and intuitiveness of the proposed method over the state-of-the-art methods.
- [1050] arXiv:2403.06433 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Fine-Grained Pillar Feature Encoding Via Spatio-Temporal Virtual Grid for 3D Object DetectionComments: ICRA 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Developing high-performance, real-time architectures for LiDAR-based 3D object detectors is essential for the successful commercialization of autonomous vehicles. Pillar-based methods stand out as a practical choice for onboard deployment due to their computational efficiency. However, despite their efficiency, these methods can sometimes underperform compared to alternative point encoding techniques such as Voxel-encoding or PointNet++. We argue that current pillar-based methods have not sufficiently captured the fine-grained distributions of LiDAR points within each pillar structure. Consequently, there exists considerable room for improvement in pillar feature encoding. In this paper, we introduce a novel pillar encoding architecture referred to as Fine-Grained Pillar Feature Encoding (FG-PFE). FG-PFE utilizes Spatio-Temporal Virtual (STV) grids to capture the distribution of point clouds within each pillar across vertical, temporal, and horizontal dimensions. Through STV grids, points within each pillar are individually encoded using Vertical PFE (V-PFE), Temporal PFE (T-PFE), and Horizontal PFE (H-PFE). These encoded features are then aggregated through an Attentive Pillar Aggregation method. Our experiments conducted on the nuScenes dataset demonstrate that FG-PFE achieves significant performance improvements over baseline models such as PointPillar, CenterPoint-Pillar, and PillarNet, with only a minor increase in computational overhead.
- [1051] arXiv:2403.06447 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: CoRAL: Collaborative Retrieval-Augmented Large Language Models Improve Long-tail RecommendationComments: 11 pagesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: The long-tail recommendation is a challenging task for traditional recommender systems, due to data sparsity and data imbalance issues. The recent development of large language models (LLMs) has shown their abilities in complex reasoning, which can help to deduce users' preferences based on very few previous interactions. However, since most LLM-based systems rely on items' semantic meaning as the sole evidence for reasoning, the collaborative information of user-item interactions is neglected, which can cause the LLM's reasoning to be misaligned with task-specific collaborative information of the dataset. To further align LLMs' reasoning to task-specific user-item interaction knowledge, we introduce collaborative retrieval-augmented LLMs, CoRAL, which directly incorporate collaborative evidence into the prompts. Based on the retrieved user-item interactions, the LLM can analyze shared and distinct preferences among users, and summarize the patterns indicating which types of users would be attracted by certain items. The retrieved collaborative evidence prompts the LLM to align its reasoning with the user-item interaction patterns in the dataset. However, since the capacity of the input prompt is limited, finding the minimally-sufficient collaborative information for recommendation tasks can be challenging. We propose to find the optimal interaction set through a sequential decision-making process and develop a retrieval policy learned through a reinforcement learning (RL) framework, CoRAL. Our experimental results show that CoRAL can significantly improve LLMs' reasoning abilities on specific recommendation tasks. Our analysis also reveals that CoRAL can more efficiently explore collaborative information through reinforcement learning.
- [1052] arXiv:2403.06448 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Hallucinations in large language models (LLMs) refer to the phenomenon of LLMs producing responses that are coherent yet factually inaccurate. This issue undermines the effectiveness of LLMs in practical applications, necessitating research into detecting and mitigating hallucinations of LLMs. Previous studies have mainly concentrated on post-processing techniques for hallucination detection, which tend to be computationally intensive and limited in effectiveness due to their separation from the LLM's inference process. To overcome these limitations, we introduce MIND, an unsupervised training framework that leverages the internal states of LLMs for real-time hallucination detection without requiring manual annotations. Additionally, we present HELM, a new benchmark for evaluating hallucination detection across multiple LLMs, featuring diverse LLM outputs and the internal states of LLMs during their inference process. Our experiments demonstrate that MIND outperforms existing state-of-the-art methods in hallucination detection.
- [1053] arXiv:2403.06465 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: RecAI: Leveraging Large Language Models for Next-Generation Recommender SystemsComments: 4 pages. Webconf 2024 demo trackSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces RecAI, a practical toolkit designed to augment or even revolutionize recommender systems with the advanced capabilities of Large Language Models (LLMs). RecAI provides a suite of tools, including Recommender AI Agent, Recommendation-oriented Language Models, Knowledge Plugin, RecExplainer, and Evaluator, to facilitate the integration of LLMs into recommender systems from multifaceted perspectives. The new generation of recommender systems, empowered by LLMs, are expected to be more versatile, explainable, conversational, and controllable, paving the way for more intelligent and user-centric recommendation experiences. We hope the open-source of RecAI can help accelerate evolution of new advanced recommender systems. The source code of RecAI is available at \url{ this https URL }.
- [1054] arXiv:2403.06466 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: RL-MSA: a Reinforcement Learning-based Multi-line bus Scheduling ApproachSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Multiple Line Bus Scheduling Problem (MLBSP) is vital to save operational cost of bus company and guarantee service quality for passengers. Existing approaches typically generate a bus scheduling scheme in an offline manner and then schedule buses according to the scheme. In practice, uncertain events such as traffic congestion occur frequently, which may make the pre-determined bus scheduling scheme infeasible. In this paper, MLBSP is modeled as a Markov Decision Process (MDP). A Reinforcement Learning-based Multi-line bus Scheduling Approach (RL-MSA) is proposed for bus scheduling at both the offline and online phases. At the offline phase, deadhead decision is integrated into bus selection decision for the first time to simplify the learning problem. At the online phase, deadhead decision is made through a time window mechanism based on the policy learned at the offline phase. We develop several new and useful state features including the features for control points, bus lines and buses. A bus priority screening mechanism is invented to construct bus-related features. Considering the interests of both the bus company and passengers, a reward function combining the final reward and the step-wise reward is devised. Experiments at the offline phase demonstrate that the number of buses used of RL-MSA is decreased compared with offline optimization approaches. At the online phase, RL-MSA can cover all departure times in a timetable (i.e., service quality) without increasing the number of buses used (i.e., operational cost).
- [1055] arXiv:2403.06479 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Ada-Tracker: Soft Tissue Tracking via Inter-Frame and Adaptive-Template MatchingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Soft tissue tracking is crucial for computer-assisted interventions. Existing approaches mainly rely on extracting discriminative features from the template and videos to recover corresponding matches. However, it is difficult to adopt these techniques in surgical scenes, where tissues are changing in shape and appearance throughout the surgery. To address this problem, we exploit optical flow to naturally capture the pixel-wise tissue deformations and adaptively correct the tracked template. Specifically, we first implement an inter-frame matching mechanism to extract a coarse region of interest based on optical flow from consecutive frames. To accommodate appearance change and alleviate drift, we then propose an adaptive-template matching method, which updates the tracked template based on the reliability of the estimates. Our approach, Ada-Tracker, enjoys both short-term dynamics modeling by capturing local deformations and long-term dynamics modeling by introducing global temporal compensation. We evaluate our approach on the public SurgT benchmark, which is generated from Hamlyn, SCARED, and Kidney boundary datasets. The experimental results show that Ada-Tracker achieves superior accuracy and performs more robustly against prior works. Code is available at this https URL .
- [1056] arXiv:2403.06514 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Structure Your Data: Towards Semantic Graph CounterfactualsJournal-ref: ICML 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Counterfactual explanations (CEs) based on concepts are explanations that consider alternative scenarios to understand which high-level semantic features contributed to particular model predictions. In this work, we propose CEs based on the semantic graphs accompanying input data to achieve more descriptive, accurate, and human-aligned explanations. Building upon state-of-the-art (SoTA) conceptual attempts, we adopt a model-agnostic edit-based approach and introduce leveraging GNNs for efficient Graph Edit Distance (GED) computation. With a focus on the visual domain, we represent images as scene graphs and obtain their GNN embeddings to bypass solving the NP-hard graph similarity problem for all input pairs, an integral part of the CE computation process. We apply our method to benchmark and real-world datasets with varying difficulty and availability of semantic annotations. Testing on diverse classifiers, we find that our CEs outperform previous SoTA explanation models based on semantics, including both white and black-box as well as conceptual and pixel-level approaches. Their superiority is proven quantitatively and qualitatively, as validated by human subjects, highlighting the significance of leveraging semantic edges in the presence of intricate relationships. Our model-agnostic graph-based approach is widely applicable and easily extensible, producing actionable explanations across different contexts.
- [1057] arXiv:2403.06517 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Active Generation for Image ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recently, the growing capabilities of deep generative models have underscored their potential in enhancing image classification accuracy. However, existing methods often demand the generation of a disproportionately large number of images compared to the original dataset, while having only marginal improvements in accuracy. This computationally expensive and time-consuming process hampers the practicality of such approaches. In this paper, we propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model. With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation. It aims to create images akin to the challenging or misclassified samples encountered by the current model and incorporates these generated images into the training set to augment model performance. ActGen introduces an attentive image guidance technique, using real images as guides during the denoising process of a diffusion model. The model's attention on class prompt is leveraged to ensure the preservation of similar foreground object while diversifying the background. Furthermore, we introduce a gradient-based generation guidance method, which employs two losses to generate more challenging samples and prevent the generated images from being too similar to previously generated ones. Experimental results on the CIFAR and ImageNet datasets demonstrate that our method achieves better performance with a significantly reduced number of generated images.
- [1058] arXiv:2403.06520 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: How to Understand Named Entities: Using Common Sense for News CaptioningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: News captioning aims to describe an image with its news article body as input. It greatly relies on a set of detected named entities, including real-world people, organizations, and places. This paper exploits commonsense knowledge to understand named entities for news captioning. By ``understand'', we mean correlating the news content with common sense in the wild, which helps an agent to 1) distinguish semantically similar named entities and 2) describe named entities using words outside of training corpora. Our approach consists of three modules: (a) Filter Module aims to clarify the common sense concerning a named entity from two aspects: what does it mean? and what is it related to?, which divide the common sense into explanatory knowledge and relevant knowledge, respectively. (b) Distinguish Module aggregates explanatory knowledge from node-degree, dependency, and distinguish three aspects to distinguish semantically similar named entities. (c) Enrich Module attaches relevant knowledge to named entities to enrich the entity description by commonsense information (e.g., identity and social position). Finally, the probability distributions from both modules are integrated to generate the news captions. Extensive experiments on two challenging datasets (i.e., GoodNews and NYTimes) demonstrate the superiority of our method. Ablation studies and visualization further validate its effectiveness in understanding named entities.
- [1059] arXiv:2403.06524 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Tactical Decision Making for Autonomous Trucks by Deep Reinforcement Learning with Total Cost of Operation Based RewardSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: We develop a deep reinforcement learning framework for tactical decision making in an autonomous truck, specifically for Adaptive Cruise Control (ACC) and lane change maneuvers in a highway scenario. Our results demonstrate that it is beneficial to separate high-level decision-making processes and low-level control actions between the reinforcement learning agent and the low-level controllers based on physical models. In the following, we study optimizing the performance with a realistic and multi-objective reward function based on Total Cost of Operation (TCOP) of the truck using different approaches; by adding weights to reward components, by normalizing the reward components and by using curriculum learning techniques.
- [1060] arXiv:2403.06534 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object DetectionComments: 22 Pages, 10 Figures, 9 TablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Abstract: Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at this https URL .
- [1061] arXiv:2403.06535 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Decentralized and Lifelong-Adaptive Multi-Agent Collaborative LearningComments: 23 pages, 15 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Decentralized and lifelong-adaptive multi-agent collaborative learning aims to enhance collaboration among multiple agents without a central server, with each agent solving varied tasks over time. To achieve efficient collaboration, agents should: i) autonomously identify beneficial collaborative relationships in a decentralized manner; and ii) adapt to dynamically changing task observations. In this paper, we propose DeLAMA, a decentralized multi-agent lifelong collaborative learning algorithm with dynamic collaboration graphs. To promote autonomous collaboration relationship learning, we propose a decentralized graph structure learning algorithm, eliminating the need for external priors. To facilitate adaptation to dynamic tasks, we design a memory unit to capture the agents' accumulated learning history and knowledge, while preserving finite storage consumption. To further augment the system's expressive capabilities and computational efficiency, we apply algorithm unrolling, leveraging the advantages of both mathematical optimization and neural networks. This allows the agents to `learn to collaborate' through the supervision of training tasks. Our theoretical analysis verifies that inter-agent collaboration is communication efficient under a small number of communication rounds. The experimental results verify its ability to facilitate the discovery of collaboration strategies and adaptation to dynamic learning scenarios, achieving a 98.80% reduction in MSE and a 188.87% improvement in classification accuracy. We expect our work can serve as a foundational technique to facilitate future works towards an intelligent, decentralized, and dynamic multi-agent system. Code is available at this https URL .
- [1062] arXiv:2403.06545 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: ReStainGAN: Leveraging IHC to IF Stain Domain Translation for in-silico Data GenerationDominik Winter , Nicolas Triltsch , Philipp Plewa , Marco Rosati , Thomas Padel , Ross Hill , Markus Schick , Nicolas BrieuComments: 4 pages, 1 figureSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: The creation of in-silico datasets can expand the utility of existing annotations to new domains with different staining patterns in computational pathology. As such, it has the potential to significantly lower the cost associated with building large and pixel precise datasets needed to train supervised deep learning models. We propose a novel approach for the generation of in-silico immunohistochemistry (IHC) images by disentangling morphology specific IHC stains into separate image channels in immunofluorescence (IF) images. The proposed approach qualitatively and quantitatively outperforms baseline methods as proven by training nucleus segmentation models on the created in-silico datasets.
- [1063] arXiv:2403.06586 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ContextGPT: Infusing LLMs Knowledge into Neuro-Symbolic Activity Recognition ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Context-aware Human Activity Recognition (HAR) is a hot research area in mobile computing, and the most effective solutions in the literature are based on supervised deep learning models. However, the actual deployment of these systems is limited by the scarcity of labeled data that is required for training. Neuro-Symbolic AI (NeSy) provides an interesting research direction to mitigate this issue, by infusing common-sense knowledge about human activities and the contexts in which they can be performed into HAR deep learning classifiers. Existing NeSy methods for context-aware HAR rely on knowledge encoded in logic-based models (e.g., ontologies) whose design, implementation, and maintenance to capture new activities and contexts require significant human engineering efforts, technical knowledge, and domain expertise. Recent works show that pre-trained Large Language Models (LLMs) effectively encode common-sense knowledge about human activities. In this work, we propose ContextGPT: a novel prompt engineering approach to retrieve from LLMs common-sense knowledge about the relationship between human activities and the context in which they are performed. Unlike ontologies, ContextGPT requires limited human effort and expertise. An extensive evaluation carried out on two public datasets shows how a NeSy model obtained by infusing common-sense knowledge from ContextGPT is effective in data scarcity scenarios, leading to similar (and sometimes better) recognition rates than logic-based approaches with a fraction of the effort.
- [1064] arXiv:2403.06592 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Exploiting Style Latent Flows for Generalizing Deepfake Detection Video DetectionComments: Preprint version, final version will be available at this https URL The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) (2024) Published by: IEEE & CVFSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a new approach for the detection of fake videos, based on the analysis of style latent vectors and their abnormal behavior in temporal changes in the generated videos. We discovered that the generated facial videos suffer from the temporal distinctiveness in the temporal changes of style latent vectors, which are inevitable during the generation of temporally stable videos with various facial expressions and geometric transformations. Our framework utilizes the StyleGRU module, trained by contrastive learning, to represent the dynamic properties of style latent vectors. Additionally, we introduce a style attention module that integrates StyleGRU-generated features with content-based features, enabling the detection of visual and temporal artifacts. We demonstrate our approach across various benchmark scenarios in deepfake detection, showing its superiority in cross-dataset and cross-manipulation scenarios. Through further analysis, we also validate the importance of using temporal changes of style latent vectors to improve the generality of deepfake video detection.
- [1065] arXiv:2403.06601 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Cross-domain and Cross-dimension Learning for Image-to-Graph TransformersAlexander H. Berger , Laurin Lux , Suprosanna Shit , Ivan Ezhov , Georgios Kaissis , Martin J. Menten , Daniel Rueckert , Johannes C. PaetzoldSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Direct image-to-graph transformation is a challenging task that solves object detection and relationship prediction in a single model. Due to the complexity of this task, large training datasets are rare in many domains, which makes the training of large networks challenging. This data sparsity necessitates the establishment of pre-training strategies akin to the state-of-the-art in computer vision. In this work, we introduce a set of methods enabling cross-domain and cross-dimension transfer learning for image-to-graph transformers. We propose (1) a regularized edge sampling loss for sampling the optimal number of object relationships (edges) across domains, (2) a domain adaptation framework for image-to-graph transformers that aligns features from different domains, and (3) a simple projection function that allows us to pretrain 3D transformers on 2D input data. We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we pretrain our models on 2D satellite images before applying them to vastly different target domains in 2D and 3D. Our method consistently outperforms a series of baselines on challenging benchmarks, such as retinal or whole-brain vessel graph extraction.
- [1066] arXiv:2403.06609 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Guiding Clinical Reasoning with Large Language Models via Knowledge SeedsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Clinical reasoning refers to the cognitive process that physicians employ in evaluating and managing patients. This process typically involves suggesting necessary examinations, diagnosing patients' diseases, and deciding on appropriate therapies, etc. Accurate clinical reasoning requires extensive medical knowledge and rich clinical experience, setting a high bar for physicians. This is particularly challenging in developing countries due to the overwhelming number of patients and limited physician resources, contributing significantly to global health inequity and necessitating automated clinical reasoning approaches. Recently, the emergence of large language models (LLMs) such as ChatGPT and GPT-4 have demonstrated their potential in clinical reasoning. However, these LLMs are prone to hallucination problems, and the reasoning process of LLMs may not align with the clinical decision path of physicians. In this study, we introduce a novel framework, In-Context Padding (ICP), designed to enhance LLMs with medical knowledge. Specifically, we infer critical clinical reasoning elements (referred to as knowledge seeds) and use these as anchors to guide the generation process of LLMs. Experiments on two clinical question datasets demonstrate that ICP significantly improves the clinical reasoning ability of LLMs.
- [1067] arXiv:2403.06611 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway EncodingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: With appropriate data selection and training techniques, Large Language Models (LLMs) have demonstrated exceptional success in various medical examinations and multiple-choice questions. However, the application of LLMs in medical dialogue generation-a task more closely aligned with actual medical practice-has been less explored. This gap is attributed to the insufficient medical knowledge of LLMs, which leads to inaccuracies and hallucinated information in the generated medical responses. In this work, we introduce the Medical dialogue with Knowledge enhancement and clinical Pathway encoding (MedKP) framework, which integrates an external knowledge enhancement module through a medical knowledge graph and an internal clinical pathway encoding via medical entities and physician actions. Evaluated with comprehensive metrics, our experiments on two large-scale, real-world online medical consultation datasets (MedDG and KaMed) demonstrate that MedKP surpasses multiple baselines and mitigates the incidence of hallucinations, achieving a new state-of-the-art. Extensive ablation studies further reveal the effectiveness of each component of MedKP. This enhancement advances the development of reliable, automated medical consultation responses using LLMs, thereby broadening the potential accessibility of precise and real-time medical assistance.
- [1068] arXiv:2403.06621 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Forest Inspection Dataset for Aerial Semantic Segmentation and Depth EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Humans use UAVs to monitor changes in forest environments since they are lightweight and provide a large variety of surveillance data. However, their information does not present enough details for understanding the scene which is needed to assess the degree of deforestation. Deep learning algorithms must be trained on large amounts of data to output accurate interpretations, but ground truth recordings of annotated forest imagery are not available. To solve this problem, we introduce a new large aerial dataset for forest inspection which contains both real-world and virtual recordings of natural environments, with densely annotated semantic segmentation labels and depth maps, taken in different illumination conditions, at various altitudes and recording angles. We test the performance of two multi-scale neural networks for solving the semantic segmentation task (HRNet and PointFlow network), studying the impact of the various acquisition conditions and the capabilities of transfer learning from virtual to real data. Our results showcase that the best results are obtained when the training is done on a dataset containing a large variety of scenarios, rather than separating the data into specific categories. We also develop a framework to assess the deforestation degree of an area.
- [1069] arXiv:2403.06631 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Evaluating the Energy Efficiency of Few-Shot Learning for Object Detection in Industrial SettingsGeorgios Tsoumplekas , Vladislav Li , Ilias Siniosoglou , Vasileios Argyriou , Sotirios K. Goudos , Ioannis D. Moscholios , Panagiotis Radoglou-Grammatikis , Panagiotis SarigiannidisComments: 7 pages, 6 figures, 4 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In the ever-evolving era of Artificial Intelligence (AI), model performance has constituted a key metric driving innovation, leading to an exponential growth in model size and complexity. However, sustainability and energy efficiency have been critical requirements during deployment in contemporary industrial settings, necessitating the use of data-efficient approaches such as few-shot learning. In this paper, to alleviate the burden of lengthy model training and minimize energy consumption, a finetuning approach to adapt standard object detection models to downstream tasks is examined. Subsequently, a thorough case study and evaluation of the energy demands of the developed models, applied in object detection benchmark datasets from volatile industrial environments is presented. Specifically, different finetuning strategies as well as utilization of ancillary evaluation data during training are examined, and the trade-off between performance and efficiency is highlighted in this low-data regime. Finally, this paper introduces a novel way to quantify this trade-off through a customized Efficiency Factor metric.
- [1070] arXiv:2403.06642 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: KELLMRec: Knowledge-Enhanced Large Language Models for RecommendationComments: 9 pages, 1 figureSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The utilization of semantic information is an important research problem in the field of recommender systems, which aims to complement the missing parts of mainstream ID-based approaches. With the rise of LLM, its ability to act as a knowledge base and its reasoning capability have opened up new possibilities for this research area, making LLM-based recommendation an emerging research direction. However, directly using LLM to process semantic information for recommendation scenarios is unreliable and sub-optimal due to several problems such as hallucination. A promising way to cope with this is to use external knowledge to aid LLM in generating truthful and usable text. Inspired by the above motivation, we propose a Knowledge-Enhanced LLMRec method. In addition to using external knowledge in prompts, the proposed method also includes a knowledge-based contrastive learning scheme for training. Experiments on public datasets and in-enterprise datasets validate the effectiveness of the proposed method.
- [1071] arXiv:2403.06659 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge EnhancementComments: Accepted by ICML2024Subjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Electrocardiograms (ECGs) are non-invasive diagnostic tools crucial for detecting cardiac arrhythmic diseases in clinical practice. While ECG Self-supervised Learning (eSSL) methods show promise in representation learning from unannotated ECG data, they often overlook the clinical knowledge that can be found in reports. This oversight and the requirement for annotated samples for downstream tasks limit eSSL's versatility. In this work, we address these issues with the Multimodal ECG Representation Learning (MERL}) framework. Through multimodal learning on ECG records and associated reports, MERL is capable of performing zero-shot ECG classification with text prompts, eliminating the need for training data in downstream tasks. At test time, we propose the Clinical Knowledge Enhanced Prompt Engineering (CKEPE) approach, which uses Large Language Models (LLMs) to exploit external expert-verified clinical knowledge databases, generating more descriptive prompts and reducing hallucinations in LLM-generated content to boost zero-shot classification. Based on MERL, we perform the first benchmark across six public ECG datasets, showing the superior performance of MERL compared against eSSL methods. Notably, MERL achieves an average AUC score of 75.2% in zero-shot classification (without training data), 3.2% higher than linear probed eSSL methods with 10\% annotated training data, averaged across all six datasets.
- [1072] arXiv:2403.06660 (cross-list from cs.MM) [ pdf , ps , html , other ]
-
Title: FashionReGen: LLM-Empowered Fashion Report GenerationSubjects: Multimedia (cs.MM) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Fashion analysis refers to the process of examining and evaluating trends, styles, and elements within the fashion industry to understand and interpret its current state, generating fashion reports. It is traditionally performed by fashion professionals based on their expertise and experience, which requires high labour cost and may also produce biased results for relying heavily on a small group of people. In this paper, to tackle the Fashion Report Generation (FashionReGen) task, we propose an intelligent Fashion Analyzing and Reporting system based the advanced Large Language Models (LLMs), debbed as GPT-FAR. Specifically, it tries to deliver FashionReGen based on effective catwalk analysis, which is equipped with several key procedures, namely, catwalk understanding, collective organization and analysis, and report generation. By posing and exploring such an open-ended, complex and domain-specific task of FashionReGen, it is able to test the general capability of LLMs in fashion domain. It also inspires the explorations of more high-level tasks with industrial significance in other domains. Video illustration and more materials of GPT-FAR can be found in this https URL .
- [1073] arXiv:2403.06670 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CEAT: Continual Expansion and Absorption Transformer for Non-Exemplar Class-Incremental LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In real-world applications, dynamic scenarios require the models to possess the capability to learn new tasks continuously without forgetting the old knowledge. Experience-Replay methods store a subset of the old images for joint training. In the scenario of more strict privacy protection, storing the old images becomes infeasible, which leads to a more severe plasticity-stability dilemma and classifier bias. To meet the above challenges, we propose a new architecture, named continual expansion and absorption transformer~(CEAT). The model can learn the novel knowledge by extending the expanded-fusion layers in parallel with the frozen previous parameters. After the task ends, we losslessly absorb the extended parameters into the backbone to ensure that the number of parameters remains constant. To improve the learning ability of the model, we designed a novel prototype contrastive loss to reduce the overlap between old and new classes in the feature space. Besides, to address the classifier bias towards the new classes, we propose a novel approach to generate the pseudo-features to correct the classifier. We experiment with our methods on three standard Non-Exemplar Class-Incremental Learning~(NECIL) benchmarks. Extensive experiments demonstrate that our model gets a significant improvement compared with the previous works and achieves 5.38%, 5.20%, and 4.92% improvement on CIFAR-100, TinyImageNet, and ImageNet-Subset.
- [1074] arXiv:2403.06674 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Car Damage Detection and Patch-to-Patch Self-supervised Image AlignmentComments: The paper has been accepted and given a poster presentation at NeurIPS 2021 WiML Workshop ( this https URL )Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Most computer vision applications aim to identify pixels in a scene and use them for diverse purposes. One intriguing application is car damage detection for insurance carriers which tends to detect all car damages by comparing both pre-trip and post-trip images, even requiring two components: (i) car damage detection; (ii) image alignment. Firstly, we implemented a Mask R-CNN model to detect car damages on custom images. Whereas for the image alignment section, we especially propose a novel self-supervised Patch-to-Patch SimCLR inspired alignment approach to find perspective transformations between custom pre/post car rental images except for traditional computer vision methods.
- [1075] arXiv:2403.06675 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Poisoning Programs by Un-Repairing Code: Security Concerns of AI-generated CodeComments: Accepted at The 1st IEEE International Workshop on Reliable and Secure AI for Software Engineering (ReSAISE), co-located with ISSRE 2023Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Abstract: AI-based code generators have gained a fundamental role in assisting developers in writing software starting from natural language (NL). However, since these large language models are trained on massive volumes of data collected from unreliable online sources (e.g., GitHub, Hugging Face), AI models become an easy target for data poisoning attacks, in which an attacker corrupts the training data by injecting a small amount of poison into it, i.e., astutely crafted malicious samples. In this position paper, we address the security of AI code generators by identifying a novel data poisoning attack that results in the generation of vulnerable code. Next, we devise an extensive evaluation of how these attacks impact state-of-the-art models for code generation. Lastly, we discuss potential solutions to overcome this threat.
- [1076] arXiv:2403.06677 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Streamlining in the Riemannian Realm: Efficient Riemannian Optimization with Loopless Variance ReductionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: In this study, we investigate stochastic optimization on Riemannian manifolds, focusing on the crucial variance reduction mechanism used in both Euclidean and Riemannian settings. Riemannian variance-reduced methods usually involve a double-loop structure, computing a full gradient at the start of each loop. Determining the optimal inner loop length is challenging in practice, as it depends on strong convexity or smoothness constants, which are often unknown or hard to estimate. Motivated by Euclidean methods, we introduce the Riemannian Loopless SVRG (R-LSVRG) and PAGE (R-PAGE) methods. These methods replace the outer loop with probabilistic gradient computation triggered by a coin flip in each iteration, ensuring simpler proofs, efficient hyperparameter selection, and sharp convergence guarantees. Using R-PAGE as a framework for non-convex Riemannian optimization, we demonstrate its applicability to various important settings. For example, we derive Riemannian MARINA (R-MARINA) for distributed settings with communication compression, providing the best theoretical communication complexity guarantees for non-convex distributed optimization over Riemannian manifolds. Experimental results support our theoretical findings.
- [1077] arXiv:2403.06725 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Improving Low-Resource Knowledge Tracing Tasks by Supervised Pre-training and Importance Mechanism Fine-tuningComments: 29 pages, 4 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Knowledge tracing (KT) aims to estimate student's knowledge mastery based on their historical interactions. Recently, the deep learning based KT (DLKT) approaches have achieved impressive performance in the KT task. These DLKT models heavily rely on the large number of available student interactions. However, due to various reasons such as budget constraints and privacy concerns, observed interactions are very limited in many real-world scenarios, a.k.a, low-resource KT datasets. Directly training a DLKT model on a low-resource KT dataset may lead to overfitting and it is difficult to choose the appropriate deep neural architecture. Therefore, in this paper, we propose a low-resource KT framework called LoReKT to address above challenges. Inspired by the prevalent "pre-training and fine-tuning" paradigm, we aim to learn transferable parameters and representations from rich-resource KT datasets during the pre-training stage and subsequently facilitate effective adaptation to low-resource KT datasets. Specifically, we simplify existing sophisticated DLKT model architectures with purely a stack of transformer decoders. We design an encoding mechanism to incorporate student interactions from multiple KT data sources and develop an importance mechanism to prioritize updating parameters with high importance while constraining less important ones during the fine-tuning stage. We evaluate LoReKT on six public KT datasets and experimental results demonstrate the superiority of our approach in terms of AUC and Accuracy. To encourage reproducible research, we make our data and code publicly available at https://anonymous.4open.science/r/LoReKT-C619.
- [1078] arXiv:2403.06735 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Enhancing Image Caption Generation Using Reinforcement Learning with Human FeedbackComments: 6 Pages, 8 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Research on generative models to produce human-aligned / human-preferred outputs has seen significant recent contributions. Between text and image-generative models, we narrowed our focus to text-based generative models, particularly to produce captions for images that align with human preferences. In this research, we explored a potential method to amplify the performance of the Deep Neural Network Model to generate captions that are preferred by humans. This was achieved by integrating Supervised Learning and Reinforcement Learning with Human Feedback (RLHF) using the Flickr8k dataset. Also, a novel loss function that is capable of optimizing the model based on human feedback is introduced. In this paper, we provide a concise sketch of our approach and results, hoping to contribute to the ongoing advances in the field of human-aligned generative AI models.
- [1079] arXiv:2403.06745 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ACT-MNMT Auto-Constriction Turning for Multilingual Neural Machine TranslationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language model (LLM) has achieved promising performance in multilingual machine translation tasks through zero/few-shot prompts or prompt-tuning. However, due to the mixture of multilingual data during the pre-training of LLM, the LLM-based translation models face the off-target issue in both prompt-based methods, including a series of phenomena, namely instruction misunderstanding, translation with wrong language and over-generation. For this issue, this paper introduces an \textbf{\underline{A}}uto-\textbf{\underline{C}}onstriction \textbf{\underline{T}}urning mechanism for \textbf{\underline{M}}ultilingual \textbf{\underline{N}}eural \textbf{\underline{M}}achine \textbf{\underline{T}}ranslation (\model), which is a novel supervised fine-tuning mechanism and orthogonal to the traditional prompt-based methods. In this method, \model automatically constructs a constrained template in the target side by adding trigger tokens ahead of the ground truth. Furthermore, trigger tokens can be arranged and combined freely to represent different task semantics, and they can be iteratively updated to maximize the label likelihood. Experiments are performed on WMT test sets with multiple metrics, and the experimental results demonstrate that \model achieves substantially improved performance across multiple translation directions and reduce the off-target phenomena in the translation.
- [1080] arXiv:2403.06754 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ALaRM: Align Language Models via Hierarchical Rewards ModelingComments: 15 pages, 6 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences. The framework addresses the limitations of current alignment approaches, which often struggle with the inconsistency and sparsity of human supervision signals, by integrating holistic rewards with aspect-specific rewards. This integration enables more precise and consistent guidance of language models towards desired outcomes, particularly in complex and open text generation tasks. By employing a methodology that filters and combines multiple rewards based on their consistency, the framework provides a reliable mechanism for improving model alignment. We validate our approach through applications in long-form question answering and machine translation tasks, employing gpt-3.5-turbo for pairwise comparisons, and demonstrate improvements over existing baselines. Our work underscores the effectiveness of hierarchical rewards modeling in refining LLM training processes for better human preference alignment. We release our code at this https URL .
- [1081] arXiv:2403.06764 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language ModelsComments: 21 papes, 8 figures, code is released at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at this https URL .
- [1082] arXiv:2403.06786 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Genetic Learning for Designing Sim-to-Real Data AugmentationsComments: 21 pages; accepted at DMLR Workshop @ ICRL 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Data augmentations are useful in closing the sim-to-real domain gap when training on synthetic data. This is because they widen the training data distribution, thus encouraging the model to generalize better to other domains. Many image augmentation techniques exist, parametrized by different settings, such as strength and probability. This leads to a large space of different possible augmentation policies. Some policies work better than others for overcoming the sim-to-real gap for specific datasets, and it is unclear why. This paper presents two different interpretable metrics that can be combined to predict how well a certain augmentation policy will work for a specific sim-to-real setting, focusing on object detection. We validate our metrics by training many models with different augmentation policies and showing a strong correlation with performance on real data. Additionally, we introduce GeneticAugment, a genetic programming method that can leverage these metrics to automatically design an augmentation policy for a specific dataset without needing to train a model.
- [1083] arXiv:2403.06817 (cross-list from cs.LO) [ pdf , ps , html , other ]
-
Title: Are Targeted Messages More Effective?Subjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Graph neural networks (GNN) are deep learning architectures for graphs. Essentially, a GNN is a distributed message passing algorithm, which is controlled by parameters learned from data. It operates on the vertices of a graph: in each iteration, vertices receive a message on each incoming edge, aggregate these messages, and then update their state based on their current state and the aggregated messages. The expressivity of GNNs can be characterised in terms of certain fragments of first-order logic with counting and the Weisfeiler-Lehman algorithm.
The core GNN architecture comes in two different versions. In the first version, a message only depends on the state of the source vertex, whereas in the second version it depends on the states of the source and target vertices. In practice, both of these versions are used, but the theory of GNNs so far mostly focused on the first one. On the logical side, the two versions correspond to two fragments of first-order logic with counting that we call modal and guarded.
The question whether the two versions differ in their expressivity has been mostly overlooked in the GNN literature and has only been asked recently (Grohe, LICS'23). We answer this question here. It turns out that the answer is not as straightforward as one might expect. By proving that the modal and guarded fragment of first-order logic with counting have the same expressivity over labelled undirected graphs, we show that in a non-uniform setting the two GNN versions have the same expressivity. However, we also prove that in a uniform setting the second version is strictly more expressive. - [1084] arXiv:2403.06826 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: In-context Exploration-Exploitation for Reinforcement LearningComments: Published at ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization. However, this method is hindered by significant computational costs resulting from the gathering of large training trajectory sets and the need to train large Transformer models. We address this challenge by introducing an In-context Exploration-Exploitation (ICEE) algorithm, designed to optimize the efficiency of in-context policy learning. Unlike existing models, ICEE performs an exploration-exploitation trade-off at inference time within a Transformer model, without the need for explicit Bayesian inference. Consequently, ICEE can solve Bayesian optimization problems as efficiently as Gaussian process biased methods do, but in significantly less time. Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.
- [1085] arXiv:2403.06828 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: NeuPAN: Direct Point Robot Navigation with End-to-End Model-based LearningRuihua Han , Shuai Wang , Shuaijun Wang , Zeqing Zhang , Jianjun Chen , Shijie Lin , Chengyang Li , Chengzhong Xu , Yonina C. Eldar , Qi Hao , Jia PanComments: submit to TROSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Navigating a nonholonomic robot in a cluttered environment requires extremely accurate perception and locomotion for collision avoidance. This paper presents NeuPAN: a real-time, highly-accurate, map-free, robot-agnostic, and environment-invariant robot navigation solution. Leveraging a tightly-coupled perception-locomotion framework, NeuPAN has two key innovations compared to existing approaches: 1) it directly maps raw points to a learned multi-frame distance space, avoiding error propagation from perception to control; 2) it is interpretable from an end-to-end model-based learning perspective, enabling provable convergence. The crux of NeuPAN is to solve a high-dimensional end-to-end mathematical model with various point-level constraints using the plug-and-play (PnP) proximal alternating-minimization network (PAN) with neurons in the loop. This allows NeuPAN to generate real-time, end-to-end, physically-interpretable motions directly from point clouds, which seamlessly integrates data- and knowledge-engines, where its network parameters are adjusted via back propagation. We evaluate NeuPAN on car-like robot, wheel-legged robot, and passenger autonomous vehicle, in both simulated and real-world environments. Experiments demonstrate that NeuPAN outperforms various benchmarks, in terms of accuracy, efficiency, robustness, and generalization capability across various environments, including the cluttered sandbox, office, corridor, and parking lot. We show that NeuPAN works well in unstructured environments with arbitrary-shape undetectable objects, making impassable ways passable.
- [1086] arXiv:2403.06832 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Power of Noise: Toward a Unified Multi-modal Knowledge Graph Representation FrameworkComments: Ongoing work; 10 pages, 6 Tables, 2 Figures; Repo is available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The advancement of Multi-modal Pre-training highlights the necessity for a robust Multi-Modal Knowledge Graph (MMKG) representation learning framework. This framework is crucial for integrating structured knowledge into multi-modal Large Language Models (LLMs) at scale, aiming to alleviate issues like knowledge misconceptions and multi-modal hallucinations. In this work, to evaluate models' ability to accurately embed entities within MMKGs, we focus on two widely researched tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking for the robust integration of multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets (three for MKGC and seven for MEMA), demonstrating its robustness and versatility. Besides, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Our code and data are available at: this https URL .
- [1087] arXiv:2403.06835 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology PromptingComments: 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Data scarcity and privacy concerns limit the availability of high-quality medical images for public use, which can be mitigated through medical image synthesis. However, current medical image synthesis methods often struggle to accurately capture the complexity of detailed anatomical structures and pathological conditions. To address these challenges, we propose a novel medical image synthesis model that leverages fine-grained image-text alignment and anatomy-pathology prompts to generate highly detailed and accurate synthetic medical images. Our method integrates advanced natural language processing techniques with image generative modeling, enabling precise alignment between descriptive text prompts and the synthesized images' anatomical and pathological details. The proposed approach consists of two key components: an anatomy-pathology prompting module and a fine-grained alignment-based synthesis module. The anatomy-pathology prompting module automatically generates descriptive prompts for high-quality medical images. To further synthesize high-quality medical images from the generated prompts, the fine-grained alignment-based synthesis module pre-defines a visual codebook for the radiology dataset and performs fine-grained alignment between the codebook and generated prompts to obtain key patches as visual clues, facilitating accurate image synthesis. We validate the superiority of our method through experiments on public chest X-ray datasets and demonstrate that our synthetic images preserve accurate semantic information, making them valuable for various medical applications.
- [1088] arXiv:2403.06840 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-FeedbackComments: 15 pages, 4 figures. Providing first version RA-ISFSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) demonstrate exceptional performance in numerous tasks but still heavily rely on knowledge stored in their parameters. Moreover, updating this knowledge incurs high training costs. Retrieval-augmented generation (RAG) methods address this issue by integrating external knowledge. The model can answer questions it couldn't previously by retrieving knowledge relevant to the query. This approach improves performance in certain scenarios for specific tasks. However, if irrelevant texts are retrieved, it may impair model performance. In this paper, we propose Retrieval Augmented Iterative Self-Feedback (RA-ISF), a framework that iteratively decomposes tasks and processes them in three submodules to enhance the model's problem-solving capabilities. Experiments show that our method outperforms existing benchmarks, performing well on models like GPT3.5, Llama2, significantly enhancing factual reasoning capabilities and reducing hallucinations.
- [1089] arXiv:2403.06869 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning with Noisy Foundation ModelsHao Chen , Jindong Wang , Zihan Wang , Ran Tao , Hongxin Wei , Xing Xie , Masashi Sugiyama , Bhiksha RajComments: 18 pages, 10 figures, 6 tables, preprint. arXiv admin note: substantial text overlap with arXiv:2309.17002Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Foundation models are usually pre-trained on large-scale datasets and then adapted to downstream tasks through tuning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1K, YFCC15M, and CC12M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as Noisy Model Learning.
- [1090] arXiv:2403.06872 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Exploring Large Language Models and Hierarchical Frameworks for Classification of Large Unstructured Legal DocumentsComments: This paper was accepted as a long paper at ECIR 2024. arXiv admin note: substantial text overlap with arXiv:2309.10563Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Legal judgment prediction suffers from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents becomes a challenging task, more so on documents with no structural annotation. We explore the classification of these large legal documents and their lack of structural information with a deep-learning-based hierarchical framework which we call MESc; "Multi-stage Encoder-based Supervised with-clustering"; for judgment prediction. Specifically, we divide a document into parts to extract their embeddings from the last four layers of a custom fine-tuned Large Language Model, and try to approximate their structure through unsupervised clustering. Which we use in another set of transformer encoder layers to learn the inter-chunk representations. We analyze the adaptability of Large Language Models (LLMs) with multi-billion parameters (GPT-Neo, and GPT-J) with the hierarchical framework of MESc and compare them with their standalone performance on legal texts. We also study their intra-domain(legal) transfer learning capability and the impact of combining embeddings from their last layers in MESc. We test these methods and their effectiveness with extensive experiments and ablation studies on legal documents from India, the European Union, and the United States with the ILDC dataset and a subset of the LexGLUE dataset. Our approach achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art methods.
- [1091] arXiv:2403.06880 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement LearningJunseok Park , Yoonsung Kim , Hee Bin Yoo , Min Whoo Lee , Kibeom Kim , Won-Seok Choi , Minsu Lee , Byoung-Tak ZhangComments: Accepted as a full paper at AAAI 2024 (Oral presentation): 7 pages (main paper), 2 pages (references), 17 pages (appendix) eachSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Toddlers evolve from free exploration with sparse feedback to exploiting prior experiences for goal-directed learning with denser rewards. Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks. Central to our inquiry is the transition from sparse to potential-based dense rewards, which share optimal strategies regardless of reward changes. Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates. Of particular note is the efficacy of the toddler-inspired Sparse-to-Dense (S2D) transition. Beyond these performance metrics, using Cross-Density Visualizer technique, we observed that transitions, especially the S2D, smooth the policy loss landscape, promoting wide minima that enhance generalization in RL models.
- [1092] arXiv:2403.06901 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: LIBR+: Improving Intraoperative Liver Registration by Learning the Residual of Biomechanics-Based Deformable RegistrationComments: 12 pages, Medical Image Computing and Computer Assisted Intervention 2024Subjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The surgical environment imposes unique challenges to the intraoperative registration of organ shapes to their preoperatively-imaged geometry. Biomechanical model-based registration remains popular, while deep learning solutions remain limited due to the sparsity and variability of intraoperative measurements and the limited ground-truth deformation of an organ that can be obtained during the surgery. In this paper, we propose a novel \textit{hybrid} registration approach that leverage a linearized iterative boundary reconstruction (LIBR) method based on linear elastic biomechanics, and use deep neural networks to learn its residual to the ground-truth deformation (LIBR+). We further formulate a dual-branch spline-residual graph convolutional neural network (SR-GCN) to assimilate information from sparse and variable intraoperative measurements and effectively propagate it through the geometry of the 3D organ. Experiments on a large intraoperative liver registration dataset demonstrated the consistent improvements achieved by LIBR+ in comparison to existing rigid, biomechnical model-based non-rigid, and deep-learning based non-rigid approaches to intraoperative liver registration.
- [1093] arXiv:2403.06906 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Cost-Sensitive Learning to Defer to Multiple Experts with Workload ConstraintsJean V. Alves , Diogo Leitão , Sérgio Jesus , Marco O. P. Sampaio , Javier Liébana , Pedro Saleiro , Mário A. T. Figueiredo , Pedro BizarroSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Learning to defer (L2D) aims to improve human-AI collaboration systems by learning how to defer decisions to humans when they are more likely to be correct than an ML classifier. Existing research in L2D overlooks key aspects of real-world systems that impede its practical adoption, namely: i) neglecting cost-sensitive scenarios, where type 1 and type 2 errors have different costs; ii) requiring concurrent human predictions for every instance of the training dataset and iii) not dealing with human work capacity constraints. To address these issues, we propose the deferral under cost and capacity constraints framework (DeCCaF). DeCCaF is a novel L2D approach, employing supervised learning to model the probability of human error under less restrictive data requirements (only one expert prediction per instance) and using constraint programming to globally minimize the error cost subject to workload limitations. We test DeCCaF in a series of cost-sensitive fraud detection scenarios with different teams of 9 synthetic fraud analysts, with individual work capacity constraints. The results demonstrate that our approach performs significantly better than the baselines in a wide array of scenarios, achieving an average 8.4% reduction in the misclassification cost.
- [1094] arXiv:2403.06914 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context LearningComments: ICLR 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a language model learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of large language models
- [1095] arXiv:2403.06925 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Simplicity Bias of Transformers to Learn Low Sensitivity FunctionsComments: 24 pages, 19 figures, 3 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Abstract: Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive. Various neural network architectures such as fully connected networks have been found to have a simplicity bias towards simple functions of the data; one version of this simplicity bias is a spectral bias to learn simple functions in the Fourier space. In this work, we identify the notion of sensitivity of the model to random changes in the input as a notion of simplicity bias which provides a unified metric to explain the simplicity and spectral bias of transformers across different data modalities. We show that transformers have lower sensitivity than alternative architectures, such as LSTMs, MLPs and CNNs, across both vision and language tasks. We also show that low-sensitivity bias correlates with improved robustness; furthermore, it can also be used as an efficient intervention to further improve the robustness of transformers.
- [1096] arXiv:2403.06936 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Counterfactual Reasoning with Knowledge Graph EmbeddingsComments: Accepted to EACL 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Knowledge graph embeddings (KGEs) were originally developed to infer true but missing facts in incomplete knowledge repositories. In this paper, we link knowledge graph completion and counterfactual reasoning via our new task CFKGR. We model the original world state as a knowledge graph, hypothetical scenarios as edges added to the graph, and plausible changes to the graph as inferences from logical rules. We create corresponding benchmark datasets, which contain diverse hypothetical scenarios with plausible changes to the original knowledge graph and facts that should be retained. We develop COULDD, a general method for adapting existing knowledge graph embeddings given a hypothetical premise, and evaluate it on our benchmark. Our results indicate that KGEs learn patterns in the graph without explicit training. We further observe that KGEs adapted with COULDD solidly detect plausible counterfactual changes to the graph that follow these patterns. An evaluation on human-annotated data reveals that KGEs adapted with COULDD are mostly unable to recognize changes to the graph that do not follow learned inference rules. In contrast, ChatGPT mostly outperforms KGEs in detecting plausible changes to the graph but has poor knowledge retention. In summary, CFKGR connects two previously distinct areas, namely KG completion and counterfactual reasoning.
- [1097] arXiv:2403.06952 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated DataComments: First two authors contributed equally; Project website: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text inputs, such as incorrect spatial relationship or missing objects. In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging. First, SELMA leverages an LLM's in-context learning capability to generate multiple datasets of text prompts that can teach different skills, and then generates the images with a T2I model based on the prompts. Next, SELMA adapts the T2I model to the new skills by learning multiple single-skill LoRA (low-rank adaptation) experts followed by expert merging. Our independent expert fine-tuning specializes multiple models for different skills, and expert merging helps build a joint multi-skill T2I model that can generate faithful images given diverse text prompts, while mitigating the knowledge conflict from different datasets. We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (PickScore, ImageReward, and HPS), as well as human evaluation. Moreover, fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data. Lastly, we show that fine-tuning with images from a weaker T2I model can help improve the generation quality of a stronger T2I model, suggesting promising weak-to-strong generalization in T2I models.
- [1098] arXiv:2403.06963 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The pitfalls of next-token predictionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Can a mere next-token predictor faithfully model human intelligence? We crystallize this intuitive concern, which is fragmented in the literature. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. We provide preliminary evidence that this failure can be resolved when training to predict multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under this https URL
- [1099] arXiv:2403.06993 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Automatic driving lane change safety prediction model based on LSTMSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
Abstract: Autonomous driving technology can improve traffic safety and reduce traffic accidents. In addition, it improves traffic flow, reduces congestion, saves energy and increases travel efficiency. In the relatively mature automatic driving technology, the automatic driving function is divided into several modules: perception, decision-making, planning and control, and a reasonable division of labor can improve the stability of the system. Therefore, autonomous vehicles need to have the ability to predict the trajectory of surrounding vehicles in order to make reasonable decision planning and safety measures to improve driving safety. By using deep learning method, a safety-sensitive deep learning model based on short term memory (LSTM) network is proposed. This model can alleviate the shortcomings of current automatic driving trajectory planning, and the output trajectory not only ensures high accuracy but also improves safety. The cell state simulation algorithm simulates the trackability of the trajectory generated by this model. The research results show that compared with the traditional model-based method, the trajectory prediction method based on LSTM network has obvious advantages in predicting the trajectory in the long time domain. The intention recognition module considering interactive information has higher prediction and accuracy, and the algorithm results show that the trajectory is very smooth based on the premise of safe prediction and efficient lane change. And autonomous vehicles can efficiently and safely complete lane changes.
- [1100] arXiv:2403.06994 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Physics Sensor Based Deep Learning Fall Detection SystemSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Fall detection based on embedded sensor is a practical and popular research direction in recent years. In terms of a specific application: fall detection methods based upon physics sensors such as [gyroscope and accelerator] have been exploited using traditional hand crafted features and feed them in machine learning models like Markov chain or just threshold based classification methods. In this paper, we build a complete system named TSFallDetect including data receiving device based on embedded sensor, mobile deep-learning model deploying platform, and a simple server, which will be used to gather models and data for future expansion. On the other hand, we exploit the sequential deep-learning methods to address this falling motion prediction problem based on data collected by inertial and film pressure sensors. We make a empirical study based on existing datasets and our datasets collected from our system separately, which shows that the deep-learning model has more potential advantage than other traditional methods, and we proposed a new deep-learning model based on the time series data to predict the fall, and it may be superior to other sequential models in this particular field.
- [1101] arXiv:2403.06999 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Survival modeling using deep learning, machine learning and statistical methods: A comparative analysis for predicting mortality after hospital admissionZiwen Wang , Jin Wee Lee , Tanujit Chakraborty , Yilin Ning , Mingxuan Liu , Feng Xie , Marcus Eng Hock Ong , Nan LiuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Survival analysis is essential for studying time-to-event outcomes and providing a dynamic understanding of the probability of an event occurring over time. Various survival analysis techniques, from traditional statistical models to state-of-the-art machine learning algorithms, support healthcare intervention and policy decisions. However, there remains ongoing discussion about their comparative performance. We conducted a comparative study of several survival analysis methods, including Cox proportional hazards (CoxPH), stepwise CoxPH, elastic net penalized Cox model, Random Survival Forests (RSF), Gradient Boosting machine (GBM) learning, AutoScore-Survival, DeepSurv, time-dependent Cox model based on neural network (CoxTime), and DeepHit survival neural network. We applied the concordance index (C-index) for model goodness-of-fit, and integral Brier scores (IBS) for calibration, and considered the model interpretability. As a case study, we performed a retrospective analysis of patients admitted through the emergency department of a tertiary hospital from 2017 to 2019, predicting 90-day all-cause mortality based on patient demographics, clinicopathological features, and historical data. The results of the C-index indicate that deep learning achieved comparable performance, with DeepSurv producing the best discrimination (DeepSurv: 0.893; CoxTime: 0.892; DeepHit: 0.891). The calibration of DeepSurv (IBS: 0.041) performed the best, followed by RSF (IBS: 0.042) and GBM (IBS: 0.0421), all using the full variables. Moreover, AutoScore-Survival, using a minimal variable subset, is easy to interpret, and can achieve good discrimination and calibration (C-index: 0.867; IBS: 0.044). While all models were satisfactory, DeepSurv exhibited the best discrimination and calibration. In addition, AutoScore-Survival offers a more parsimonious model and excellent interpretability.
- [1102] arXiv:2403.07008 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: AutoEval Done Right: Using Synthetic Data for Model EvaluationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
Abstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
- [1103] arXiv:2403.07017 (cross-list from physics.soc-ph) [ pdf , ps , html , other ]
-
Title: Mathematics of multi-agent learning systems at the interface of game theory and artificial intelligenceComments: 8 pages, 1 figureSubjects: Physics and Society (physics.soc-ph) ; Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Abstract: Evolutionary Game Theory (EGT) and Artificial Intelligence (AI) are two fields that, at first glance, might seem distinct, but they have notable connections and intersections. The former focuses on the evolution of behaviors (or strategies) in a population, where individuals interact with others and update their strategies based on imitation (or social learning). The more successful a strategy is, the more prevalent it becomes over time. The latter, meanwhile, is centered on machine learning algorithms and (deep) neural networks. It is often from a single-agent perspective but increasingly involves multi-agent environments, in which intelligent agents adjust their strategies based on feedback and experience, somewhat akin to the evolutionary process yet distinct in their self-learning capacities. In light of the key components necessary to address real-world problems, including (i) learning and adaptation, (ii) cooperation and competition, (iii) robustness and stability, and altogether (iv) population dynamics of individual agents whose strategies evolve, the cross-fertilization of ideas between both fields will contribute to the advancement of mathematics of multi-agent learning systems, in particular, to the nascent domain of ``collective cooperative intelligence'' bridging evolutionary dynamics and multi-agent reinforcement learning.
- [1104] arXiv:2403.07022 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Unified Model for Spatio-Temporal Prediction Queries with Arbitrary Modifiable Areal UnitsComments: Accepted by ICDE 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Spatio-Temporal (ST) prediction is crucial for making informed decisions in urban location-based applications like ride-sharing. However, existing ST models often require region partition as a prerequisite, resulting in two main pitfalls. Firstly, location-based services necessitate ad-hoc regions for various purposes, requiring multiple ST models with varying scales and zones, which can be costly to support. Secondly, different ST models may produce conflicting outputs, resulting in confusing predictions. In this paper, we propose One4All-ST, a framework that can conduct ST prediction for arbitrary modifiable areal units using only one model. To reduce the cost of getting multi-scale predictions, we design an ST network with hierarchical spatial modeling and scale normalization modules to efficiently and equally learn multi-scale representations. To address prediction inconsistencies across scales, we propose a dynamic programming scheme to solve the formulated optimal combination problem, minimizing predicted error through theoretical analysis. Besides, we suggest using an extended quad-tree to index the optimal combinations for quick response to arbitrary modifiable areal units in practical online scenarios. Extensive experiments on two real-world datasets verify the efficiency and effectiveness of One4All-ST in ST prediction for arbitrary modifiable areal units. The source codes and data of this work are available at this https URL .
- [1105] arXiv:2403.07028 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: An Efficient Learning-based Solver Comparable to Metaheuristics for the Capacitated Arc Routing ProblemSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: Recently, neural networks (NN) have made great strides in combinatorial optimization. However, they face challenges when solving the capacitated arc routing problem (CARP) which is to find the minimum-cost tour covering all required edges on a graph, while within capacity constraints. In tackling CARP, NN-based approaches tend to lag behind advanced metaheuristics, since they lack directed arc modeling and efficient learning methods tailored for complex CARP. In this paper, we introduce an NN-based solver to significantly narrow the gap with advanced metaheuristics while exhibiting superior efficiency. First, we propose the direction-aware attention model (DaAM) to incorporate directionality into the embedding process, facilitating more effective one-stage decision-making. Second, we design a supervised reinforcement learning scheme that involves supervised pre-training to establish a robust initial policy for subsequent reinforcement fine-tuning. It proves particularly valuable for solving CARP that has a higher complexity than the node routing problems (NRPs). Finally, a path optimization method is proposed to adjust the depot return positions within the path generated by DaAM. Experiments illustrate that our approach surpasses heuristics and achieves decision quality comparable to state-of-the-art metaheuristics for the first time while maintaining superior efficiency.
- [1106] arXiv:2403.07032 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: STARFlow: Spatial Temporal Feature Re-embedding with Attentive Learning for Real-world Scene FlowComments: 10 pages, 8 figures, CVPR templateSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Scene flow prediction is a crucial underlying task in understanding dynamic scenes as it offers fundamental motion information. However, contemporary scene flow methods encounter three major challenges. Firstly, flow estimation solely based on local receptive fields lacks long-dependency matching of point pairs. To address this issue, we propose global attentive flow embedding to match all-to-all point pairs in both feature space and Euclidean space, providing global initialization before local refinement. Secondly, there are deformations existing in non-rigid objects after warping, which leads to variations in the spatiotemporal relation between the consecutive frames. For a more precise estimation of residual flow, a spatial temporal feature re-embedding module is devised to acquire the sequence features after deformation. Furthermore, previous methods perform poor generalization due to the significant domain gap between the synthesized and LiDAR-scanned datasets. We leverage novel domain adaptive losses to effectively bridge the gap of motion inference from synthetic to real-world. Experiments demonstrate that our approach achieves state-of-the-art performance across various datasets, with particularly outstanding results on real-world LiDAR-scanned datasets. Our code is available at this https URL .
- [1107] arXiv:2403.07033 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Interpreting What Typical Fault Signals Look Like via Prototype-matchingComments: 17 pages, 12 figures, 6 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Neural networks, with powerful nonlinear mapping and classification capabilities, are widely applied in mechanical fault diagnosis to ensure safety. However, being typical black-box models, their application is limited in high-reliability-required scenarios. To understand the classification logic and explain what typical fault signals look like, the prototype matching network (PMN) is proposed by combining the human-inherent prototype-matching with autoencoder (AE). The PMN matches AE-extracted feature with each prototype and selects the most similar prototype as the prediction result. It has three interpreting paths on classification logic, fault prototypes, and matching contributions. Conventional diagnosis and domain generalization experiments demonstrate its competitive diagnostic performance and distinguished advantages in representation learning. Besides, the learned typical fault signals (i.e., sample-level prototypes) showcase the ability for denoising and extracting subtle key features that experts find challenging to capture. This ability broadens human understanding and provides a promising solution from interpretability research to AI-for-Science.
- [1108] arXiv:2403.07039 (cross-list from cs.AR) [ pdf , ps , other ]
-
Title: From English to ASIC: Hardware Implementation with Large Language ModelComments: 15 pages, 1 figureSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: In the realm of ASIC engineering, the landscape has been significantly reshaped by the rapid development of LLM, paralleled by an increase in the complexity of modern digital circuits. This complexity has escalated the requirements for HDL coding, necessitating a higher degree of precision and sophistication. However, challenges have been faced due to the less-than-optimal performance of modern language models in generating hardware description code, a situation further exacerbated by the scarcity of the corresponding high-quality code datasets. These challenges have highlighted the gap between the potential of LLMs to revolutionize digital circuit design and their current capabilities in accurately interpreting and implementing hardware specifications. To address these challenges, a strategy focusing on the fine-tuning of the leading-edge nature language model and the reshuffling of the HDL code dataset has been developed. The fine-tuning aims to enhance models' proficiency in generating precise and efficient ASIC design, while the dataset reshuffling is intended to broaden the scope and improve the quality of training material. The model demonstrated significant improvements compared to the base model, with approximately 10% to 20% increase in accuracy across a wide range of temperature for the pass@1 metric. This approach is expected to facilitate a simplified and more efficient LLM-assisted framework for complex circuit design, leveraging their capabilities to meet the sophisticated demands of HDL coding and thus streamlining the ASIC development process.
- [1109] arXiv:2403.07040 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: All in One: Multi-Task Prompting for Graph Neural Networks (Extended Abstract)Comments: submitted to IJCAI 2024 Sister Conferences Track. The original paper can be seen at arXiv:2307.01504Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper is an extended abstract of our original work published in KDD23, where we won the best research paper award (Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan. All in one: Multi-task prompting for graph neural networks. KDD 23) The paper introduces a novel approach to bridging the gap between pre-trained graph models and the diverse tasks they're applied to, inspired by the success of prompt learning in NLP. Recognizing the challenge of aligning pre-trained models with varied graph tasks (node level, edge level, and graph level), which can lead to negative transfer and poor performance, we propose a multi-task prompting method for graphs. This method involves unifying graph and language prompt formats, enabling NLP's prompting strategies to be adapted for graph tasks. By analyzing the task space of graph applications, we reformulate problems to fit graph-level tasks and apply meta-learning to improve prompt initialization for multiple tasks. Experiments show our method's effectiveness in enhancing model performance across different graph tasks.
Beyond the original work, in this extended abstract, we further discuss the graph prompt from a bigger picture and provide some of the latest work toward this area. - [1110] arXiv:2403.07076 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Mapping High-level Semantic Regions in Indoor Environments without Object RecognitionComments: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2024)Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments. In the literature, there has been an extensive focus on object labeling and exhaustive scene graph generation; less effort has been focused on the task of purely identifying and mapping large semantic regions. The present work proposes a method for semantic region mapping via embodied navigation in indoor environments, generating a high-level representation of the knowledge of the agent. To enable region identification, the method uses a vision-to-language model to provide scene information for mapping. By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location. This mapping procedure is paired with a trained navigation policy to enable autonomous map generation. The proposed method significantly outperforms a variety of baselines, including an object-based system and a pretrained scene classifier, in experiments in a photorealistic simulator.
- [1111] arXiv:2403.07078 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Improving deep learning with prior knowledge and cognitive models: A survey on enhancing explainability, adversarial robustness and zero-shot learningJournal-ref: Cognitive Systems Research, 84 (2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: We review current and emerging knowledge-informed and brain-inspired cognitive systems for realizing adversarial defenses, eXplainable Artificial Intelligence (XAI), and zero-shot or few-short learning. Data-driven deep learning models have achieved remarkable performance and demonstrated capabilities surpassing human experts in many applications. Yet, their inability to exploit domain knowledge leads to serious performance limitations in practical applications. In particular, deep learning systems are exposed to adversarial attacks, which can trick them into making glaringly incorrect decisions. Moreover, complex data-driven models typically lack interpretability or explainability, i.e., their decisions cannot be understood by human subjects. Furthermore, models are usually trained on standard datasets with a closed-world assumption. Hence, they struggle to generalize to unseen cases during inference in practical open-world environments, thus, raising the zero- or few-shot generalization problem. Although many conventional solutions exist, explicit domain knowledge, brain-inspired neural network and cognitive architectures offer powerful new dimensions towards alleviating these problems. Prior knowledge is represented in appropriate forms and incorporated in deep learning frameworks to improve performance. Brain-inspired cognition methods use computational models that mimic the human mind to enhance intelligent behavior in artificial agents and autonomous robots. Ultimately, these models achieve better explainability, higher adversarial robustness and data-efficient learning, and can, in turn, provide insights for cognitive science and neuroscience-that is, to deepen human understanding on how the brain works in general, and how it handles these problems.
- [1112] arXiv:2403.07087 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: LSTM-Based Text Generation: A Study on Historical DatasetsJournal-ref: 16th International Istanbul Scientific Research Congress on Life, Engineering, Architecture, and Mathematical Sciences Proceedings Book, Pages: 42-49, 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents an exploration of Long Short-Term Memory (LSTM) networks in the realm of text generation, focusing on the utilization of historical datasets for Shakespeare and Nietzsche. LSTMs, known for their effectiveness in handling sequential data, are applied here to model complex language patterns and structures inherent in historical texts. The study demonstrates that LSTM-based models, when trained on historical datasets, can not only generate text that is linguistically rich and contextually relevant but also provide insights into the evolution of language patterns over time. The finding presents models that are highly accurate and efficient in predicting text from works of Nietzsche, with low loss values and a training time of 100 iterations. The accuracy of the model is 0.9521, indicating high accuracy. The loss of the model is 0.2518, indicating its effectiveness. The accuracy of the model in predicting text from the work of Shakespeare is 0.9125, indicating a low error rate. The training time of the model is 100, mirroring the efficiency of the Nietzsche dataset. This efficiency demonstrates the effectiveness of the model design and training methodology, especially when handling complex literary texts. This research contributes to the field of natural language processing by showcasing the versatility of LSTM networks in text generation and offering a pathway for future explorations in historical linguistics and beyond.
- [1113] arXiv:2403.07090 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Time Series Analysis of Key Societal Events as Reflected in Complex Social Media Data StreamsComments: AAAI2024 Workshop on AI for Time Series Analysis (AI4TS)Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Social media platforms hold valuable insights, yet extracting essential information can be challenging. Traditional top-down approaches often struggle to capture critical signals in rapidly changing events. As global events evolve swiftly, social media narratives, including instances of disinformation, become significant sources of insights. To address the need for an inductive strategy, we explore a niche social media platform GAB and an established messaging service Telegram, to develop methodologies applicable on a broader scale. This study investigates narrative evolution on these platforms using quantitative corpus-based discourse analysis techniques. Our approach is a novel mode to study multiple social media domains to distil key information which may be obscured otherwise, allowing for useful and actionable insights. The paper details the technical and methodological aspects of gathering and preprocessing GAB and Telegram data for a keyness (Log Ratio) metric analysis, identifying crucial nouns and verbs for deeper exploration. Empirically, this approach is applied to a case study of a well defined event that had global impact: the 2023 Wagner mutiny. The main findings are: (1) the time line can be deconstructed to provide useful data features allowing for improved interpretation; (2) a methodology is applied which provides a basis for generalization. The key contribution is an approach, that in some cases, provides the ability to capture the dynamic narrative shifts over time with elevated confidence. The approach can augment near-real-time assessment of key social movements, allowing for informed governance choices. This research is important because it lays out a useful methodology for time series relevant info-culling, which can enable proactive modes for positive social engagement.
- [1114] arXiv:2403.07136 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: On the Limited Representational Power of Value Functions and its Links to Statistical (In)EfficiencySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Identifying the trade-offs between model-based and model-free methods is a central question in reinforcement learning. Value-based methods offer substantial computational advantages and are sometimes just as statistically efficient as model-based methods. However, focusing on the core problem of policy evaluation, we show information about the transition dynamics may be impossible to represent in the space of value functions. We explore this through a series of case studies focused on structures that arises in many important problems. In several, there is no information loss and value-based methods are as statistically efficient as model based ones. In other closely-related examples, information loss is severe and value-based methods are severely outperformed. A deeper investigation points to the limitations of the representational power as the driver of the inefficiency, as opposed to failure in algorithm design.
- [1115] arXiv:2403.07151 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Don't Forget What I did?: Assessing Client Contributions in Federated LearningBishwamittra Ghosh , Debabrota Basu , Fu Huazhu , Wang Yuan , Renuga Kanagavelu , Jiang Jin Peng , Liu Yong , Goh Siow Mong Rick , Wei QingsongComments: Under submissionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Federated Learning (FL) is a collaborative machine learning (ML) approach, where multiple clients participate in training an ML model without exposing the private data. Fair and accurate assessment of client contributions is an important problem in FL to facilitate incentive allocation and encouraging diverse clients to participate in a unified model training. Existing methods for assessing client contribution adopts co-operative game-theoretic concepts, such as Shapley values, but under simplified assumptions. In this paper, we propose a history-aware game-theoretic framework, called FLContrib, to assess client contributions when a subset of (potentially non-i.i.d.) clients participate in each epoch of FL training. By exploiting the FL training process and linearity of Shapley value, we develop FLContrib that yields a historical timeline of client contributions as FL training progresses over epochs. Additionally, to assess client contribution under limited computational budget, we propose a scheduling procedure that considers a two-sided fairness criteria to perform expensive Shapley value computation only in a subset of training epochs. In experiments, we demonstrate a controlled trade-off between the correctness and efficiency of client contributions assessed via FLContrib. To demonstrate the benefits of history-aware client contributions, we apply FLContrib to detect dishonest clients conducting data poisoning in FL training.
- [1116] arXiv:2403.07175 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Rebuilding ROME : Resolving Model Collapse during Sequential Model EditingComments: Added explanation of failure of original implementation of ROME in the paperSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent work using Rank-One Model Editing (ROME), a popular model editing method, has shown that there are certain facts that the algorithm is unable to edit without breaking the model. Such edits have previously been called disabling edits. These disabling edits cause immediate model collapse and limits the use of ROME for sequential editing. In this paper, we show that disabling edits are an artifact of irregularities in the implementation of ROME. With this paper, we provide a more stable implementation ROME, which we call r-ROME and show that model collapse is no longer observed when making large scale sequential edits with r-ROME, while further improving generalization and locality of model editing compared to the original implementation of ROME. We also provide a detailed mathematical explanation of the reason behind disabling edits.
- [1117] arXiv:2403.07183 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer ReviewsWeixin Liang , Zachary Izzo , Yaohui Zhang , Haley Lepp , Hancheng Cao , Xuandong Zhao , Lingjiao Chen , Haotian Ye , Sheng Liu , Zhi Huang , Daniel A. McFarland , James Y. ZouComments: 42 pages, 30 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
- [1118] arXiv:2403.07191 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: $\mathbf{(N,K)}$-Puzzle: A Cost-Efficient Testbed for Benchmarking Reinforcement Learning Algorithms in Generative Language ModelComments: 8 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale. Yet, there is a noticeable absence of a cost-effective and standardized testbed tailored to evaluating and comparing these algorithms. To bridge this gap, we present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO).
- [1119] arXiv:2403.07193 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: CuentosIE: can a chatbot about "tales with a message" help to teach emotional intelligence?Antonio Ferrández , Rocío Lavigne-Cerván , Jesús Peral , Ignasi Navarro-Soria , Ángel Lloret , David Gil , Carmen RocamoraComments: 26 pagesJournal-ref: PeerJ Computer Science, Volume 10, February 2024, ID e1866Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this article, we present CuentosIE (TalesEI: chatbot of tales with a message to develop Emotional Intelligence), an educational chatbot on emotions that also provides teachers and psychologists with a tool to monitor their students/patients through indicators and data compiled by CuentosIE. The use of "tales with a message" is justified by their simplicity and easy understanding, thanks to their moral or associated metaphors. The main contributions of CuentosIE are the selection, collection, and classification of a set of highly specialized tales, as well as the provision of tools (searching, reading comprehension, chatting, recommending, and classifying) that are useful for both educating users about emotions and monitoring their emotional development. The preliminary evaluation of the tool has obtained encouraging results, which provides an affirmative answer to the question posed in the title of the article.
- [1120] arXiv:2403.07194 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Improving prediction of students' performance in intelligent tutoring systems using attribute selection and ensembles of different multimodal data sourcesJournal-ref: Journal of Computing in Higher Education,2021, 33, 614-634Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: The aim of this study was to predict university students' learning performance using different sources of data from an Intelligent Tutoring System. We collected and preprocessed data from 40 students from different multimodal sources: learning strategies from system logs, emotions from face recording videos, interaction zones from eye tracking, and test performance from final knowledge evaluation. Our objective was to test whether the prediction could be improved by using attribute selection and classification ensembles. We carried out three experiments by applying six classification algorithms to numerical and discretized preprocessed multimodal data. The results show that the best predictions were produced using ensembles and selecting the best attributes approach with numerical data.
- [1121] arXiv:2403.07201 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A multi-cohort study on prediction of acute brain dysfunction states using selective state space modelsBrandon Silva , Miguel Contreras , Sabyasachi Bandyopadhyay , Yuanfang Ren , Ziyuan Guan , Jeremy Balch , Kia Khezeli , Tezcan Ozrazgat Baslanti , Ben Shickel , Azra Bihorac , Parisa RashidiComments: 22 pages, 8 figures, To be publishedSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Applications (stat.AP)
Abstract: Assessing acute brain dysfunction (ABD), including delirium and coma in the intensive care unit (ICU), is a critical challenge due to its prevalence and severe implications for patient outcomes. Current diagnostic methods rely on infrequent clinical observations, which can only determine a patient's ABD status after onset. Our research attempts to solve these problems by harnessing Electronic Health Records (EHR) data to develop automated methods for ABD prediction for patients in the ICU. Existing models solely predict a single state (e.g., either delirium or coma), require at least 24 hours of observation data to make predictions, do not dynamically predict fluctuating ABD conditions during ICU stay (typically a one-time prediction), and use small sample size, proprietary single-hospital datasets. Our research fills these gaps in the existing literature by dynamically predicting delirium, coma, and mortality for 12-hour intervals throughout an ICU stay and validating on two public datasets. Our research also introduces the concept of dynamically predicting critical transitions from non-ABD to ABD and between different ABD states in real time, which could be clinically more informative for the hospital staff. We compared the predictive performance of two state-of-the-art neural network models, the MAMBA selective state space model and the Longformer Transformer model. Using the MAMBA model, we achieved a mean area under the receiving operator characteristic curve (AUROC) of 0.95 on outcome prediction of ABD for 12-hour intervals. The model achieves a mean AUROC of 0.79 when predicting transitions between ABD states. Our study uses a curated dataset from the University of Florida Health Shands Hospital for internal validation and two publicly available datasets, MIMIC-IV and eICU, for external validation, demonstrating robustness across ICU stays from 203 hospitals and 140,945 patients.
- [1122] arXiv:2403.07230 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked PreferencesComments: Work in progressSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness. More specifically, Curry-DPO achieves a score of 7.43 on MT-bench with Zephy-7B model outperforming majority of existing LLMs with similar parameter size. Curry-DPO also achieves the highest adjusted win rates on Vicuna, WizardLM, and UltraFeedback test datasets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of upto 7.5% when compared to standard DPO technique.
- [1123] arXiv:2403.07255 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Deep Learning-Assisted Parallel Interference Cancellation for Grant-Free NOMA in Machine-Type CommunicationSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper, we present a novel approach for joint activity detection (AD), channel estimation (CE), and data detection (DD) in uplink grant-free non-orthogonal multiple access (NOMA) systems. Our approach employs an iterative and parallel interference removal strategy inspired by parallel interference cancellation (PIC), enhanced with deep learning to jointly tackle the AD, CE, and DD problems. Based on this approach, we develop three PIC frameworks, each of which is designed for either coherent or non-coherence schemes. The first framework performs joint AD and CE using received pilot signals in the coherent scheme. Building upon this framework, the second framework utilizes both the received pilot and data signals for CE, further enhancing the performances of AD, CE, and DD in the coherent scheme. The third framework is designed to accommodate the non-coherent scheme involving a small number of data bits, which simultaneously performs AD and DD. Through joint loss functions and interference cancellation modules, our approach supports end-to-end training, contributing to enhanced performances of AD, CE, and DD for both coherent and non-coherent schemes. Simulation results demonstrate the superiority of our approach over traditional techniques, exhibiting enhanced performances of AD, CE, and DD while maintaining lower computational complexity.
- [1124] arXiv:2403.07261 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Disentangling Policy from Offline Task Representation Learning via Adversarial Data AugmentationChengxing Jia , Fuxiang Zhang , Yi-Chen Li , Chen-Xiao Gao , Xu-Hui Liu , Lei Yuan , Zongzhang Zhang , Yang YuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Offline meta-reinforcement learning (OMRL) proficiently allows an agent to tackle novel tasks while solely relying on a static dataset. For precise and efficient task identification, existing OMRL research suggests learning separate task representations that be incorporated with policy input, thus forming a context-based meta-policy. A major approach to train task representations is to adopt contrastive learning using multi-task offline data. The dataset typically encompasses interactions from various policies (i.e., the behavior policies), thus providing a plethora of contextual information regarding different tasks. Nonetheless, amassing data from a substantial number of policies is not only impractical but also often unattainable in realistic settings. Instead, we resort to a more constrained yet practical scenario, where multi-task data collection occurs with a limited number of policies. We observed that learned task representations from previous OMRL methods tend to correlate spuriously with the behavior policy instead of reflecting the essential characteristics of the task, resulting in unfavorable out-of-distribution generalization. To alleviate this issue, we introduce a novel algorithm to disentangle the impact of behavior policy from task representation learning through a process called adversarial data augmentation. Specifically, the objective of adversarial data augmentation is not merely to generate data analogous to offline data distribution; instead, it aims to create adversarial examples designed to confound learned task representations and lead to incorrect task identification. Our experiments show that learning from such adversarial samples significantly enhances the robustness and effectiveness of the task identification process and realizes satisfactory out-of-distribution generalization.
- [1125] arXiv:2403.07262 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Advantage-Aware Policy Optimization for Offline Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Offline Reinforcement Learning (RL) endeavors to leverage offline datasets to craft effective agent policy without online interaction, which imposes proper conservative constraints with the support of behavior policies to tackle the Out-Of-Distribution (OOD) problem. However, existing works often suffer from the constraint conflict issue when offline datasets are collected from multiple behavior policies, i.e., different behavior policies may exhibit inconsistent actions with distinct returns across the state space. To remedy this issue, recent Advantage-Weighted (AW) methods prioritize samples with high advantage values for agent training while inevitably leading to overfitting on these samples. In this paper, we introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning under mixed-quality datasets. Specifically, A2PO employs a Conditional Variational Auto-Encoder (CVAE) to disentangle the action distributions of intertwined behavior policies by modeling the advantage values of all training data as conditional variables. Then the agent can follow such disentangled action distribution constraints to optimize the advantage-aware policy towards high advantage values. Extensive experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to state-of-the-art counterparts. Our code will be made publicly available.
- [1126] arXiv:2403.07271 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: Anderson acceleration for iteratively reweighted $\ell_1$ algorithmSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract: Iteratively reweighted L1 (IRL1) algorithm is a common algorithm for solving sparse optimization problems with nonconvex and nonsmooth regularization. The development of its acceleration algorithm, often employing Nesterov acceleration, has sparked significant interest. Nevertheless, the convergence and complexity analysis of these acceleration algorithms consistently poses substantial challenges. Recently, Anderson acceleration has gained prominence owing to its exceptional performance for speeding up fixed-point iteration, with numerous recent studies applying it to gradient-based algorithms. Motivated by the powerful impact of Anderson acceleration, we propose an Anderson-accelerated IRL1 algorithm and establish its local linear convergence rate. We extend this convergence result, typically observed in smooth settings, to a nonsmooth scenario. Importantly, our theoretical results do not depend on the Kurdyka-Lojasiewicz condition, a necessary condition in existing Nesterov acceleration-based algorithms. Furthermore, to ensure global convergence, we introduce a globally convergent Anderson accelerated IRL1 algorithm by incorporating a classical nonmonotone line search condition. Experimental results indicate that our algorithm outperforms existing Nesterov acceleration-based algorithms.
- [1127] arXiv:2403.07277 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Bayesian Approach to OOD Robustness in Image ClassificationComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: An important and unsolved problem in computer vision is to ensure that the algorithms are robust to changes in image domains. We address this problem in the scenario where we have access to images from the target domains but no annotations. Motivated by the challenges of the OOD-CV benchmark where we encounter real world Out-of-Domain (OOD) nuisances and occlusion, we introduce a novel Bayesian approach to OOD robustness for object classification. Our work extends Compositional Neural Networks (CompNets), which have been shown to be robust to occlusion but degrade badly when tested on OOD data. We exploit the fact that CompNets contain a generative head defined over feature vectors represented by von Mises-Fisher (vMF) kernels, which correspond roughly to object parts, and can be learned without supervision. We obverse that some vMF kernels are similar between different domains, while others are not. This enables us to learn a transitional dictionary of vMF kernels that are intermediate between the source and target domains and train the generative model on this dictionary using the annotations on the source domain, followed by iterative refinement. This approach, termed Unsupervised Generative Transition (UGT), performs very well in OOD scenarios even when occlusion is present. UGT is evaluated on different OOD benchmarks including the OOD-CV dataset, several popular datasets (e.g., ImageNet-C [9]), artificial image corruptions (including adding occluders), and synthetic-to-real domain transfer, and does well in all scenarios outperforming SOTA alternatives (e.g. up to 10% top-1 accuracy on Occluded OOD-CV dataset).
- [1128] arXiv:2403.07292 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Continual All-in-One Adverse Weather Removal with Knowledge Replay on a Unified Network StructureSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In real-world applications, image degeneration caused by adverse weather is always complex and changes with different weather conditions from days and seasons. Systems in real-world environments constantly encounter adverse weather conditions that are not previously observed. Therefore, it practically requires adverse weather removal models to continually learn from incrementally collected data reflecting various degeneration types. Existing adverse weather removal approaches, for either single or multiple adverse weathers, are mainly designed for a static learning paradigm, which assumes that the data of all types of degenerations to handle can be finely collected at one time before a single-phase learning process. They thus cannot directly handle the incremental learning requirements. To address this issue, we made the earliest effort to investigate the continual all-in-one adverse weather removal task, in a setting closer to real-world applications. Specifically, we develop a novel continual learning framework with effective knowledge replay (KR) on a unified network structure. Equipped with a principal component projection and an effective knowledge distillation mechanism, the proposed KR techniques are tailored for the all-in-one weather removal task. It considers the characteristics of the image restoration task with multiple degenerations in continual learning, and the knowledge for different degenerations can be shared and accumulated in the unified network structure. Extensive experimental results demonstrate the effectiveness of the proposed method to deal with this challenging task, which performs competitively to existing dedicated or joint training image restoration methods. Our code is available at this https URL .
- [1129] arXiv:2403.07294 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Graph Data Condensation via Self-expressive Graph Structure ReconstructionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: With the increasing demands of training graph neural networks (GNNs) on large-scale graphs, graph data condensation has emerged as a critical technique to relieve the storage and time costs during the training phase. It aims to condense the original large-scale graph to a much smaller synthetic graph while preserving the essential information necessary for efficiently training a downstream GNN. However, existing methods concentrate either on optimizing node features exclusively or endeavor to independently learn node features and the graph structure generator. They could not explicitly leverage the information of the original graph structure and failed to construct an interpretable graph structure for the synthetic dataset. To address these issues, we introduce a novel framework named \textbf{G}raph Data \textbf{C}ondensation via \textbf{S}elf-expressive Graph Structure \textbf{R}econstruction (\textbf{GCSR}). Our method stands out by (1) explicitly incorporating the original graph structure into the condensing process and (2) capturing the nuanced interdependencies between the condensed nodes by reconstructing an interpretable self-expressive graph structure. Extensive experiments and comprehensive analysis validate the efficacy of the proposed method across diverse GNN models and datasets. Our code is available at this https URL
- [1130] arXiv:2403.07308 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Verification-Aided Learning of Neural Network Barrier Functions with Termination GuaranteesComments: This is an online extended version of the same paper accepted to American Control Conference 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Barrier functions are a general framework for establishing a safety guarantee for a system. However, there is no general method for finding these functions. To address this shortcoming, recent approaches use self-supervised learning techniques to learn these functions using training data that are periodically generated by a verification procedure, leading to a verification-aided learning framework. Despite its immense potential in automating barrier function synthesis, the verification-aided learning framework does not have termination guarantees and may suffer from a low success rate of finding a valid barrier function in practice. In this paper, we propose a holistic approach to address these drawbacks. With a convex formulation of the barrier function synthesis, we propose to first learn an empirically well-behaved NN basis function and then apply a fine-tuning algorithm that exploits the convexity and counterexamples from the verification failure to find a valid barrier function with finite-step termination guarantees: if there exist valid barrier functions, the fine-tuning algorithm is guaranteed to find one in a finite number of iterations. We demonstrate that our fine-tuning method can significantly boost the performance of the verification-aided learning framework on examples of different scales and using various neural network verifiers.
- [1131] arXiv:2403.07309 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Reinforced Sequential Decision-Making for Sepsis Treatment: The POSNEGDM Framework with Mortality Classifier and TransformerComments: Accepted to IEEE Journal of Biomedical and Health Informatics, Mar 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Sepsis, a life-threatening condition triggered by the body's exaggerated response to infection, demands urgent intervention to prevent severe complications. Existing machine learning methods for managing sepsis struggle in offline scenarios, exhibiting suboptimal performance with survival rates below 50%. This paper introduces the POSNEGDM -- ``Reinforcement Learning with Positive and Negative Demonstrations for Sequential Decision-Making" framework utilizing an innovative transformer-based model and a feedback reinforcer to replicate expert actions while considering individual patient characteristics. A mortality classifier with 96.7\% accuracy guides treatment decisions towards positive outcomes. The POSNEGDM framework significantly improves patient survival, saving 97.39% of patients, outperforming established machine learning algorithms (Decision Transformer and Behavioral Cloning) with survival rates of 33.4% and 43.5%, respectively. Additionally, ablation studies underscore the critical role of the transformer-based decision maker and the integration of a mortality classifier in enhancing overall survival rates. In summary, our proposed approach presents a promising avenue for enhancing sepsis treatment outcomes, contributing to improved patient care and reduced healthcare costs.
- [1132] arXiv:2403.07322 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: A Question-centric Multi-experts Contrastive Learning Framework for Improving the Accuracy and Interpretability of Deep Sequential Knowledge Tracing ModelsComments: 24 pages, 8 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Knowledge tracing (KT) plays a crucial role in predicting students' future performance by analyzing their historical learning processes. Deep neural networks (DNNs) have shown great potential in solving the KT problem. However, there still exist some important challenges when applying deep learning techniques to model the KT process. The first challenge lies in taking the individual information of the question into modeling. This is crucial because, despite questions sharing the same knowledge component (KC), students' knowledge acquisition on homogeneous questions can vary significantly. The second challenge lies in interpreting the prediction results from existing deep learning-based KT models. In real-world applications, while it may not be necessary to have complete transparency and interpretability of the model parameters, it is crucial to present the model's prediction results in a manner that teachers find interpretable. This makes teachers accept the rationale behind the prediction results and utilize them to design teaching activities and tailored learning strategies for students. However, the inherent black-box nature of deep learning techniques often poses a hurdle for teachers to fully embrace the model's prediction results. To address these challenges, we propose a Question-centric Multi-experts Contrastive Learning framework for KT called Q-MCKT.
- [1133] arXiv:2403.07332 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Large Window-based Mamba UNet for Medical Image Segmentation: Beyond Convolution and Self-attentionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In clinical practice, medical image segmentation provides useful information on the contours and dimensions of target organs or tissues, facilitating improved diagnosis, analysis, and treatment. In the past few years, convolutional neural networks (CNNs) and Transformers have dominated this area, but they still suffer from either limited receptive fields or costly long-range modeling. Mamba, a State Space Sequence Model (SSM), recently emerged as a promising paradigm for long-range dependency modeling with linear complexity. In this paper, we introduce a Large Window-based Mamba U}-shape Network, or LMa-UNet, for 2D and 3D medical image segmentation. A distinguishing feature of our LMa-UNet is its utilization of large windows, excelling in locally spatial modeling compared to small kernel-based CNNs and small window-based Transformers, while maintaining superior efficiency in global modeling compared to self-attention with quadratic complexity. Additionally, we design a novel hierarchical and bidirectional Mamba block to further enhance the global and neighborhood spatial modeling capability of Mamba. Comprehensive experiments demonstrate the effectiveness and efficiency of our method and the feasibility of using large window size to achieve large receptive fields. Codes are available at this https URL .
- [1134] arXiv:2403.07342 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Rethinking ASTE: A Minimalist Tagging Scheme Alongside Contrastive LearningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Aspect Sentiment Triplet Extraction (ASTE) is a burgeoning subtask of fine-grained sentiment analysis, aiming to extract structured sentiment triplets from unstructured textual data. Existing approaches to ASTE often complicate the task with additional structures or external data. In this research, we propose a novel tagging scheme and employ a contrastive learning approach to mitigate these challenges. The proposed approach demonstrates comparable or superior performance in comparison to state-of-the-art techniques, while featuring a more compact design and reduced computational overhead. Notably, even in the era of Large Language Models (LLMs), our method exhibits superior efficacy compared to GPT 3.5 and GPT 4 in a few-shot learning scenarios. This study also provides valuable insights for the advancement of ASTE techniques within the paradigm of large language models.
- [1135] arXiv:2403.07350 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: KEBench: A Benchmark on Knowledge Editing for Large Vision-Language ModelsComments: 13 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Currently, little research has been done on knowledge editing for Large Vision-Language Models (LVLMs). Editing LVLMs faces the challenge of effectively integrating diverse modalities (image and text) while ensuring coherent and contextually relevant modifications. An existing benchmark has three metrics (Reliability, Locality and Generality) to measure knowledge editing for LVLMs. However, the benchmark falls short in the quality of generated images used in evaluation and cannot assess whether models effectively utilize edited knowledge in relation to the associated content. We adopt different data collection methods to construct a new benchmark, $\textbf{KEBench}$, and extend new metric (Portability) for a comprehensive evaluation. Leveraging a multimodal knowledge graph, our image data exhibits clear directionality towards entities. This directional aspect can be further utilized to extract entity-related knowledge and form editing data. We conducted experiments of different editing methods on five LVLMs, and thoroughly analyze how these methods impact the models. The results reveal strengths and deficiencies of these methods and, hopefully, provide insights into potential avenues for future research.
- [1136] arXiv:2403.07355 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Vector Quantization for Deep-Learning-Based CSI Feedback in Massive MIMO SystemsSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This paper presents a finite-rate deep-learning (DL)-based channel state information (CSI) feedback method for massive multiple-input multiple-output (MIMO) systems. The presented method provides a finite-bit representation of the latent vector based on a vector-quantized variational autoencoder (VQ-VAE) framework while reducing its computational complexity based on shape-gain vector quantization. In this method, the magnitude of the latent vector is quantized using a non-uniform scalar codebook with a proper transformation function, while the direction of the latent vector is quantized using a trainable Grassmannian codebook. A multi-rate codebook design strategy is also developed by introducing a codeword selection rule for a nested codebook along with the design of a loss function. Simulation results demonstrate that the proposed method reduces the computational complexity associated with VQ-VAE while improving CSI reconstruction performance under a given feedback overhead.
- [1137] arXiv:2403.07362 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine UnlearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The trustworthy machine learning (ML) community is increasingly recognizing the crucial need for models capable of selectively 'unlearning' data points after training. This leads to the problem of machine unlearning (MU), aiming to eliminate the influence of chosen data points on model performance, while still maintaining the model's utility post-unlearning. Despite various MU methods for data influence erasure, evaluations have largely focused on random data forgetting, ignoring the vital inquiry into which subset should be chosen to truly gauge the authenticity of unlearning performance. To tackle this issue, we introduce a new evaluative angle for MU from an adversarial viewpoint. We propose identifying the data subset that presents the most significant challenge for influence erasure, i.e., pinpointing the worst-case forget set. Utilizing a bi-level optimization principle, we amplify unlearning challenges at the upper optimization level to emulate worst-case scenarios, while simultaneously engaging in standard training and unlearning at the lower level, achieving a balance between data influence erasure and model utility. Our proposal offers a worst-case evaluation of MU's resilience and effectiveness. Through extensive experiments across different datasets (including CIFAR-10, 100, CelebA, Tiny ImageNet, and ImageNet) and models (including both image classifiers and generative models), we expose critical pros and cons in existing (approximate) unlearning strategies. Our results illuminate the complex challenges of MU in practice, guiding the future development of more accurate and robust unlearning algorithms. The code is available at this https URL .
- [1138] arXiv:2403.07376 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled ReasoningBingqian Lin , Yunshuang Nie , Ziming Wei , Jiaqi Chen , Shikui Ma , Jianhua Han , Hang Xu , Xiaojun Chang , Xiaodan LiangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Abstract: Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. However, their predominant use in an offline manner usually suffers from substantial domain gap between the VLN task and the LLM training corpus. This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision, leading to a significant mitigation of the domain gap in a cost-effective manner. Specifically, at each timestep, the LLM is prompted to forecast the navigational chain-of-thought by: 1) acting as a world model to imagine the next observation according to the instruction, 2) selecting the candidate observation that best aligns with the imagination, and 3) determining the action based on the reasoning from the prior steps. Through constructing formalized labels for training, the LLM can learn to generate desired and reasonable chain-of-thought outputs for improving the action decision. Experimental results across various training settings and popular VLN benchmarks (e.g., Room-to-Room (R2R), Room-across-Room (RxR), Room-for-Room (R4R)) show the significant superiority of NavCoT over the direct action prediction variants. Through simple parameter-efficient finetuning, our NavCoT outperforms a recent GPT4-based approach with ~7% relative improvement on the R2R dataset. We believe that NavCoT will help unlock more task-adaptive and scalable LLM-based embodied agents, which are helpful for developing real-world robotics applications. Code is available at this https URL .
- [1139] arXiv:2403.07380 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Gabor-guided transformer for single image derainingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Image deraining have have gained a great deal of attention in order to address the challenges posed by the effects of harsh weather conditions on visual tasks. While convolutional neural networks (CNNs) are popular, their limitations in capturing global information may result in ineffective rain removal. Transformer-based methods with self-attention mechanisms have improved, but they tend to distort high-frequency details that are crucial for image fidelity. To solve this problem, we propose the Gabor-guided tranformer (Gabformer) for single image deraining. The focus on local texture features is enhanced by incorporating the information processed by the Gabor filter into the query vector, which also improves the robustness of the model to noise due to the properties of the filter. Extensive experiments on the benchmarks demonstrate that our method outperforms state-of-the-art approaches.
- [1140] arXiv:2403.07384 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
- [1141] arXiv:2403.07389 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Auxiliary CycleGAN-guidance for Task-Aware Domain Translation from Duplex to Monoplex IHC ImagesNicolas Brieu , Nicolas Triltsch , Philipp Wortmann , Dominik Winter , Shashank Saran , Marlon Rebelatto , Günter SchmidtComments: 4 pages, 2 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Abstract: Generative models enable the translation from a source image domain where readily trained models are available to a target domain unseen during training. While Cycle Generative Adversarial Networks (GANs) are well established, the associated cycle consistency constrain relies on that an invertible mapping exists between the two domains. This is, however, not the case for the translation between images stained with chromogenic monoplex and duplex immunohistochemistry (IHC) assays. Focusing on the translation from the latter to the first, we propose - through the introduction of a novel training design, an alternative constrain leveraging a set of immunofluorescence (IF) images as an auxiliary unpaired image domain. Quantitative and qualitative results on a downstream segmentation task show the benefit of the proposed method in comparison to baseline approaches.
- [1142] arXiv:2403.07398 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Complex Reasoning over Logical Queries on Commonsense Knowledge GraphsComments: 19 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Event commonsense reasoning requires the ability to reason about the relationship between events, as well as infer implicit context underlying that relationship. However, data scarcity makes it challenging for language models to learn to generate commonsense inferences for contexts and questions involving interactions between complex events. To address this demand, we present COM2 (COMplex COMmonsense), a new dataset created by sampling multi-hop logical queries (e.g., the joint effect or cause of both event A and B, or the effect of the effect of event C) from an existing commonsense knowledge graph (CSKG), and verbalizing them using handcrafted rules and large language models into multiple-choice and text generation questions. Our experiments show that language models trained on COM2 exhibit significant improvements in complex reasoning ability, resulting in enhanced zero-shot performance in both in-domain and out-of-domain tasks for question answering and generative commonsense reasoning, without expensive human annotations.
- [1143] arXiv:2403.07403 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: From Canteen Food to Daily Meals: Generalizing Food Recognition to More Practical ScenariosSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The precise recognition of food categories plays a pivotal role for intelligent health management, attracting significant research attention in recent years. Prominent benchmarks, such as Food-101 and VIREO Food-172, provide abundant food image resources that catalyze the prosperity of research in this field. Nevertheless, these datasets are well-curated from canteen scenarios and thus deviate from food appearances in daily life. This discrepancy poses great challenges in effectively transferring classifiers trained on these canteen datasets to broader daily-life scenarios encountered by humans. Toward this end, we present two new benchmarks, namely DailyFood-172 and DailyFood-16, specifically designed to curate food images from everyday meals. These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain. In addition, we also propose a simple yet effective baseline method named Multi-Cluster Reference Learning (MCRL) to tackle the aforementioned domain gap. MCRL is motivated by the observation that food images in daily-life scenarios exhibit greater intra-class appearance variance compared with those in well-curated benchmarks. Notably, MCRL can be seamlessly coupled with existing approaches, yielding non-trivial performance enhancements. We hope our new benchmarks can inspire the community to explore the transferability of food recognition models trained on well-curated datasets toward practical real-life applications.
- [1144] arXiv:2403.07404 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Accelerated Inference and Reduced Forgetting: The Dual Benefits of Early-Exit Networks in Continual LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Driven by the demand for energy-efficient employment of deep neural networks, early-exit methods have experienced a notable increase in research attention. These strategies allow for swift predictions by making decisions early in the network, thereby conserving computation time and resources. However, so far the early-exit networks have only been developed for stationary data distributions, which restricts their application in real-world scenarios with continuous non-stationary data. This study aims to explore the continual learning of the early-exit networks. We adapt existing continual learning methods to fit with early-exit architectures and investigate their behavior in the continual setting. We notice that early network layers exhibit reduced forgetting and can outperform standard networks even when using significantly fewer resources. Furthermore, we analyze the impact of task-recency bias on early-exit inference and propose Task-wise Logits Correction (TLC), a simple method that equalizes this bias and improves the network performance for every given compute budget in the class-incremental setting. We assess the accuracy and computational cost of various continual learning techniques enhanced with early-exits and TLC across standard class-incremental learning benchmarks such as 10 split CIFAR100 and ImageNetSubset and show that TLC can achieve the accuracy of the standard methods using less than 70\% of their computations. Moreover, at full computational budget, our method outperforms the accuracy of the standard counterparts by up to 15 percentage points. Our research underscores the inherent synergy between early-exit networks and continual learning, emphasizing their practical utility in resource-constrained environments.
- [1145] arXiv:2403.07440 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Matrix-Transformation Based Low-Rank Adaptation (MTLoRA): A Brain-Inspired Method for Parameter-Efficient Fine-TuningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Fine-tuning techniques based on Large Pretrained Language Models (LPLMs) have been proven to significantly enhance model performance on a variety of downstream tasks and effectively control the output behaviors of LPLMs. Recent studies have proposed numerous methods for fine-tuning a small number of parameters based on open-source LPLMs, reducing the demand for computational and storage resources. Among these, reparameterization fine-tuning methods represented by LoRA (Low-Rank Adaptation) have gained popularity. We find that although these methods perform well in many aspects, there is still considerable room for improvement in terms of complex task adaptability, performance, stability, and algorithm complexity. In response to this, inspired by the idea that the functions of the brain are shaped by its geometric structure, this paper integrates this idea into LoRA technology and proposes a new matrix transformation-based reparameterization method for efficient fine-tuning, named Matrix-Transformation based Low-Rank Adaptation (MTLoRA). MTLoRA aims to dynamically alter its spatial geometric structure by applying a transformation-matrix T to perform linear transformations, such as rotation, scaling, and translation, on the task-specific parameter matrix, generating new matrix feature patterns (eigenvectors) to mimic the fundamental influence of complex geometric structure feature patterns in the brain on functions, thereby enhancing the model's performance in downstream tasks. In Natural Language Understanding (NLU) tasks, it is evaluated using the GLUE benchmark test, and the results reveal that MTLoRA achieves an overall performance increase of about 1.0% across eight tasks; in Natural Language Generation (NLG) tasks, MTLoRA improves performance by an average of 0.95% and 0.56% in the DART and WebNLG tasks, respectively.
- [1146] arXiv:2403.07483 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Deep Learning Approach to Diabetes DiagnosisComments: Accepted to ACIIDS 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Diabetes, resulting from inadequate insulin production or utilization, causes extensive harm to the body. Existing diagnostic methods are often invasive and come with drawbacks, such as cost constraints. Although there are machine learning models like Classwise k Nearest Neighbor (CkNN) and General Regression Neural Network (GRNN), they struggle with imbalanced data and result in under-performance. Leveraging advancements in sensor technology and machine learning, we propose a non-invasive diabetes diagnosis using a Back Propagation Neural Network (BPNN) with batch normalization, incorporating data re-sampling and normalization for class balancing. Our method addresses existing challenges such as limited performance associated with traditional machine learning. Experimental results on three datasets show significant improvements in overall accuracy, sensitivity, and specificity compared to traditional methods. Notably, we achieve accuracies of 89.81% in Pima diabetes dataset, 75.49% in CDC BRFSS2015 dataset, and 95.28% in Mesra Diabetes dataset. This underscores the potential of deep learning models for robust diabetes diagnosis. See project website this https URL
- [1147] arXiv:2403.07500 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Block-wise LoRA: Revisiting Fine-grained LoRA for Effective Personalization and Stylization in Text-to-Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The objective of personalization and stylization in text-to-image is to instruct a pre-trained diffusion model to analyze new concepts introduced by users and incorporate them into expected styles. Recently, parameter-efficient fine-tuning (PEFT) approaches have been widely adopted to address this task and have greatly propelled the development of this field. Despite their popularity, existing efficient fine-tuning methods still struggle to achieve effective personalization and stylization in T2I generation. To address this issue, we propose block-wise Low-Rank Adaptation (LoRA) to perform fine-grained fine-tuning for different blocks of SD, which can generate images faithful to input prompts and target identity and also with desired style. Extensive experiments demonstrate the effectiveness of the proposed method.
- [1148] arXiv:2403.07540 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: WannaLaugh: A Configurable Ransomware Emulator -- Learning to Mimic Malicious Storage TracesDionysios Diamantopolous , Roman Pletka , Slavisa Sarafijanovic , A.L. Narasimha Reddy , Haris PozidisSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Ransomware, a fearsome and rapidly evolving cybersecurity threat, continues to inflict severe consequences on individuals and organizations worldwide. Traditional detection methods, reliant on static signatures and application behavioral patterns, are challenged by the dynamic nature of these threats. This paper introduces three primary contributions to address this challenge. First, we introduce a ransomware emulator. This tool is designed to safely mimic ransomware attacks without causing actual harm or spreading malware, making it a unique solution for studying ransomware behavior. Second, we demonstrate how we use this emulator to create storage I/O traces. These traces are then utilized to train machine-learning models. Our results show that these models are effective in detecting ransomware, highlighting the practical application of our emulator in developing responsible cybersecurity tools. Third, we show how our emulator can be used to mimic the I/O behavior of existing ransomware thereby enabling safe trace collection. Both the emulator and its application represent significant steps forward in ransomware detection in the era of machine-learning-driven cybersecurity.
- [1149] arXiv:2403.07553 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: The future of document indexing: GPT and Donut revolutionize table of content processingComments: Document AI, Document Classification, Information extraction, Large Language Models, OCR Models, Visual Document UnderstandingSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.
- [1150] arXiv:2403.07559 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Ensembling Prioritized Hybrid Policies for Multi-agent PathfindingSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Multi-Agent Reinforcement Learning (MARL) based Multi-Agent Path Finding (MAPF) has recently gained attention due to its efficiency and scalability. Several MARL-MAPF methods choose to use communication to enrich the information one agent can perceive. However, existing works still struggle in structured environments with high obstacle density and a high number of agents. To further improve the performance of the communication-based MARL-MAPF solvers, we propose a new method, Ensembling Prioritized Hybrid Policies (EPH). We first propose a selective communication block to gather richer information for better agent coordination within multi-agent environments and train the model with a Q-learning-based algorithm. We further introduce three advanced inference strategies aimed at bolstering performance during the execution phase. First, we hybridize the neural policy with single-agent expert guidance for navigating conflict-free zones. Secondly, we propose Q value-based methods for prioritized resolution of conflicts as well as deadlock situations. Finally, we introduce a robust ensemble method that can efficiently collect the best out of multiple possible solutions. We empirically evaluate EPH in complex multi-agent environments and demonstrate competitive performance against state-of-the-art neural methods for MAPF.
- [1151] arXiv:2403.07573 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Towards a Dynamic Future with Adaptable Computing and Network Convergence (ACNC)Masoud Shokrnezhad , Hao Yu , Tarik Taleb , Richard Li , Kyunghan Lee , Jaeseung Song , Cedric WestphalSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Abstract: In the context of advancing 6G, a substantial paradigm shift is anticipated, highlighting comprehensive everything-to-everything interactions characterized by numerous connections and stringent adherence to Quality of Service/Experience (QoS/E) prerequisites. The imminent challenge stems from resource scarcity, prompting a deliberate transition to Computing-Network Convergence (CNC) as an auspicious approach for joint resource orchestration. While CNC-based mechanisms have garnered attention, their effectiveness in realizing future services, particularly in use cases like the Metaverse, may encounter limitations due to the continually changing nature of users, services, and resources. Hence, this paper presents the concept of Adaptable CNC (ACNC) as an autonomous Machine Learning (ML)-aided mechanism crafted for the joint orchestration of computing and network resources, catering to dynamic and voluminous user requests with stringent requirements. ACNC encompasses two primary functionalities: state recognition and context detection. Given the intricate nature of the user-service-computing-network space, the paper employs dimension reduction to generate live, holistic, abstract system states in a hierarchical structure. To address the challenges posed by dynamic changes, Continual Learning (CL) is employed, classifying the system state into contexts controlled by dedicated ML agents, enabling them to operate efficiently. These two functionalities are intricately linked within a closed loop overseen by the End-to-End (E2E) orchestrator to allocate resources. The paper introduces the components of ACNC, proposes a Metaverse scenario to exemplify ACNC's role in resource provisioning with Segment Routing v6 (SRv6), outlines ACNC's workflow, details a numerical analysis for efficiency assessment, and concludes with discussions on relevant challenges and potential avenues for future research.
- [1152] arXiv:2403.07586 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Federated Learning of Socially Appropriate Agent Behaviours in Simulated Home EnvironmentsComments: Accepted at the Workshop on Lifelong Learning and Personalization in Long-Term Human-Robot Interaction (LEAP-HRI) at the 19th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Robotics (cs.RO)
Abstract: As social robots become increasingly integrated into daily life, ensuring their behaviours align with social norms is crucial. For their widespread open-world application, it is important to explore Federated Learning (FL) settings where individual robots can learn about their unique environments while also learning from each others' experiences. In this paper, we present a novel FL benchmark that evaluates different strategies, using multi-label regression objectives, where each client individually learns to predict the social appropriateness of different robot actions while also sharing their learning with others. Furthermore, splitting the training data by different contexts such that each client incrementally learns across contexts, we present a novel Federated Continual Learning (FCL) benchmark that adapts FL-based methods to use state-of-the-art Continual Learning (CL) methods to continually learn socially appropriate agent behaviours under different contextual settings. Federated Averaging (FedAvg) of weights emerges as a robust FL strategy while rehearsal-based FCL enables incrementally learning the social appropriateness of robot actions, across contextual splits.
- [1153] arXiv:2403.07605 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In text-to-image generation, using negative prompts, which describe undesirable image characteristics, can significantly boost image quality. However, producing good negative prompts is manual and tedious. To address this, we propose NegOpt, a novel method for optimizing negative prompt generation toward enhanced image generation, using supervised fine-tuning and reinforcement learning. Our combined approach results in a substantial increase of 25% in Inception Score compared to other approaches and surpasses ground-truth negative prompts from the test set. Furthermore, with NegOpt we can preferentially optimize the metrics most important to us. Finally, we construct Negative Prompts DB, a dataset of negative prompts.
- [1154] arXiv:2403.07608 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: Couler: Unified Machine Learning Workflow Optimization in CloudXiaoda Wang , Yuan Tang , Tengda Guo , Bo Sang , Jingji Wu , Jian Sha , Ke Zhang , Jiang Qian , Mingjie TangSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Machine Learning (ML) has become ubiquitous, fueling data-driven applications across various organizations. Contrary to the traditional perception of ML in research, ML workflows can be complex, resource-intensive, and time-consuming. Expanding an ML workflow to encompass a wider range of data infrastructure and data types may lead to larger workloads and increased deployment costs. Currently, numerous workflow engines are available (with over ten being widely recognized). This variety poses a challenge for end-users in terms of mastering different engine APIs. While efforts have primarily focused on optimizing ML Operations (MLOps) for a specific workflow engine, current methods largely overlook workflow optimization across different engines.
In this work, we design and implement Couler, a system designed for unified ML workflow optimization in the cloud. Our main insight lies in the ability to generate an ML workflow using natural language (NL) descriptions. We integrate Large Language Models (LLMs) into workflow generation, and provide a unified programming interface for various workflow engines. This approach alleviates the need to understand various workflow engines' APIs. Moreover, Couler enhances workflow computation efficiency by introducing automated caching at multiple stages, enabling large workflow auto-parallelization and automatic hyperparameters tuning. These enhancements minimize redundant computational costs and improve fault tolerance during deep learning workflow training. Couler is extensively deployed in real-world production scenarios at Ant Group, handling approximately 22k workflows daily, and has successfully improved the CPU/Memory utilization by more than 15% and the workflow completion rate by around 17%. - [1155] arXiv:2403.07611 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Efficient Knowledge Deletion from Trained Models through Layer-wise Partial Machine UnlearningComments: 16pages, 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Machine unlearning has garnered significant attention due to its ability to selectively erase knowledge obtained from specific training data samples in an already trained machine learning model. This capability enables data holders to adhere strictly to data protection regulations. However, existing unlearning techniques face practical constraints, often causing performance degradation, demanding brief fine-tuning post unlearning, and requiring significant storage. In response, this paper introduces a novel class of machine unlearning algorithms. First method is partial amnesiac unlearning, integration of layer-wise pruning with amnesiac unlearning. In this method, updates made to the model during training are pruned and stored, subsequently used to forget specific data from trained model. The second method assimilates layer-wise partial-updates into label-flipping and optimization-based unlearning to mitigate the adverse effects of data deletion on model efficacy. Through a detailed experimental evaluation, we showcase the effectiveness of proposed unlearning methods. Experimental results highlight that the partial amnesiac unlearning not only preserves model efficacy but also eliminates the necessity for brief post fine-tuning, unlike conventional amnesiac unlearning. Moreover, employing layer-wise partial updates in label-flipping and optimization-based unlearning techniques demonstrates superiority in preserving model efficacy compared to their naive counterparts.
- [1156] arXiv:2403.07622 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Multiple Latent Space Mapping for Compressed Dark Image EnhancementYi Zeng , Zhengning Wang , Yuxuan Liu , Tianjiao Zeng , Xuhang Liu , Xinglong Luo , Shuaicheng Liu , Shuyuan Zhu , Bing ZengSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Abstract: Dark image enhancement aims at converting dark images to normal-light images. Existing dark image enhancement methods take uncompressed dark images as inputs and achieve great performance. However, in practice, dark images are often compressed before storage or transmission over the Internet. Current methods get poor performance when processing compressed dark images. Artifacts hidden in the dark regions are amplified by current methods, which results in uncomfortable visual effects for observers. Based on this observation, this study aims at enhancing compressed dark images while avoiding compression artifacts amplification. Since texture details intertwine with compression artifacts in compressed dark images, detail enhancement and blocking artifacts suppression contradict each other in image space. Therefore, we handle the task in latent space. To this end, we propose a novel latent mapping network based on variational auto-encoder (VAE). Firstly, different from previous VAE-based methods with single-resolution features only, we exploit multiple latent spaces with multi-resolution features, to reduce the detail blur and improve image fidelity. Specifically, we train two multi-level VAEs to project compressed dark images and normal-light images into their latent spaces respectively. Secondly, we leverage a latent mapping network to transform features from compressed dark space to normal-light space. Specifically, since the degradation models of darkness and compression are different from each other, the latent mapping process is divided mapping into enlightening branch and deblocking branch. Comprehensive experiments demonstrate that the proposed method achieves state-of-the-art performance in compressed dark image enhancement.
- [1157] arXiv:2403.07630 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent weakly supervised semantic segmentation (WSSS) methods strive to incorporate contextual knowledge to improve the completeness of class activation maps (CAM). In this work, we argue that the knowledge bias between instances and contexts affects the capability of the prototype to sufficiently understand instance semantics. Inspired by prototype learning theory, we propose leveraging prototype awareness to capture diverse and fine-grained feature attributes of instances. The hypothesis is that contextual prototypes might erroneously activate similar and frequently co-occurring object categories due to this knowledge bias. Therefore, we propose to enhance the prototype representation ability by mitigating the bias to better capture spatial coverage in semantic object regions. With this goal, we present a Context Prototype-Aware Learning (CPAL) strategy, which leverages semantic context to enrich instance comprehension. The core of this method is to accurately capture intra-class variations in object features through context-aware prototypes, facilitating the adaptation to the semantic attributes of various instances. We design feature distribution alignment to optimize prototype awareness, aligning instance feature distributions with dense features. In addition, a unified training framework is proposed to combine label-guided classification supervision and prototypes-guided self-supervision. Experimental results on PASCAL VOC 2012 and MS COCO 2014 show that CPAL significantly improves off-the-shelf methods and achieves state-of-the-art performance. The project is available at this https URL .
- [1158] arXiv:2403.07657 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Scalable Spatiotemporal Prediction with Bayesian Neural FieldsFeras Saad , Jacob Burnim , Colin Carroll , Brian Patton , Urs Köster , Rif A. Saurous , Matthew HoffmanComments: 22 pages, 6 figures, 3 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
Abstract: Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in many scientific and business-intelligence applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As modern datasets continue to increase in size and complexity, there is a growing need for new statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle large prediction problems. This work presents the Bayesian Neural Field (BayesNF), a domain-general statistical model for inferring rich probability distributions over a spatiotemporal domain, which can be used for data-analysis tasks including forecasting, interpolation, and variography. BayesNF integrates a novel deep neural network architecture for high-capacity function estimation with hierarchical Bayesian inference for robust uncertainty quantification. By defining the prior through a sequence of smooth differentiable transforms, posterior inference is conducted on large-scale data using variationally learned surrogates trained via stochastic gradient descent. We evaluate BayesNF against prominent statistical and machine-learning baselines, showing considerable improvements on diverse prediction problems from climate and public health datasets that contain tens to hundreds of thousands of measurements. The paper is accompanied with an open-source software package ( this https URL ) that is easy-to-use and compatible with modern GPU and TPU accelerators on the JAX machine learning platform.
- [1159] arXiv:2403.07687 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation CostComments: accepted at COLING 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Current foundation models have shown impressive performance across various tasks. However, several studies have revealed that these models are not effective for everyone due to the imbalanced geographical and economic representation of the data used in the training process. Most of this data comes from Western countries, leading to poor results for underrepresented countries. To address this issue, more data needs to be collected from these countries, but the cost of annotation can be a significant bottleneck. In this paper, we propose methods to identify the data to be annotated to balance model performance and annotation costs. Our approach first involves finding the countries with images of topics (objects and actions) most visually distinct from those already in the training datasets used by current large vision-language foundation models. Next, we identify countries with higher visual similarity for these topics and show that using data from these countries to supplement the training data improves model performance and reduces annotation costs. The resulting lists of countries and corresponding topics are made available at this https URL .
- [1160] arXiv:2403.07688 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of NeuronsSimon Dufort-Labbé , Pierluca D'Oro , Evgenii Nikishin , Razvan Pascanu , Pierre-Luc Bacon , Aristide BaratinSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: When training deep neural networks, the phenomenon of $\textit{dying neurons}$ $\unicode{x2013}$units that become inactive or saturated, output zero during training$\unicode{x2013}$ has traditionally been viewed as undesirable, linked with optimization challenges, and contributing to plasticity loss in continual learning scenarios. In this paper, we reassess this phenomenon, focusing on sparsity and pruning. By systematically exploring the impact of various hyperparameter configurations on dying neurons, we unveil their potential to facilitate simple yet effective structured pruning algorithms. We introduce $\textit{Demon Pruning}$ (DemP), a method that controls the proliferation of dead neurons, dynamically leading to network sparsity. Achieved through a combination of noise injection on active units and a one-cycled schedule regularization strategy, DemP stands out for its simplicity and broad applicability. Experiments on CIFAR10 and ImageNet datasets demonstrate that DemP surpasses existing structured pruning techniques, showcasing superior accuracy-sparsity tradeoffs and training speedups. These findings suggest a novel perspective on dying neurons as a valuable resource for efficient model compression and optimization.
- [1161] arXiv:2403.07691 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ORPO: Monolithic Preference Optimization without Reference ModelComments: PreprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on $\text{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-$\alpha$ (7B) and Mistral-ORPO-$\beta$ (7B).
- [1162] arXiv:2403.07693 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Large, Small or Both: A Novel Data Augmentation Framework Based on Language Models for Debiasing Opinion SummarizationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: As more than 70$\%$ of reviews in the existing opinion summary data set are positive, current opinion summarization approaches are reluctant to generate negative summaries given the input of negative texts. To address such sentiment bias, a direct approach without the over-reliance on a specific framework is to generate additional data based on large language models to balance the emotional distribution of the dataset. However, data augmentation based on large language models faces two disadvantages: 1) the potential issues or toxicity in the augmented data; 2) the expensive costs. Therefore, in this paper, we propose a novel data augmentation framework based on both large and small language models for debiasing opinion summarization. In specific, a small size of synthesized negative reviews is obtained by rewriting the positive text via a large language model. Then, a disentangle reconstruction model is trained based on the generated data. After training, a large amount of synthetic data can be obtained by decoding the new representation obtained from the combination of different sample representations and filtering based on confusion degree and sentiment classification. Experiments have proved that our framework can effectively alleviate emotional bias same as using only large models, but more economically.
- [1163] arXiv:2403.07704 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement LearningComments: Accepted at AAAI 2024: The 38th Annual AAAI Conference on Artificial Intelligence (Main Tech Track)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.
- [1164] arXiv:2403.07708 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Improving Reinforcement Learning from Human Feedback Using Contrastive RewardsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as \textit{contrastive rewards}. %Contrastive rewards Our approach involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses and used in the Proximal Policy Optimization (PPO) step. We show that contrastive rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We show empirically contrastive rewards can improve RLHF substantially, evaluated by both GPTs and humans, and our method consistently outperforms strong baselines.
- [1165] arXiv:2403.07711 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State SpacesComments: Accepted as a workshop paper at ICLR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their memory consumption, which increases quadratically with the length of the sequence. This limitation presents significant challenges when attempting to generate longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs). SSMs have recently gained attention as viable alternatives due to their linear memory consumption relative to sequence length. In the experiments, we first evaluate our SSM-based model with UCF101, a standard benchmark of video generation. In addition, to investigate the potential of SSMs for longer video generation, we perform an experiment using the MineRL Navigate dataset, varying the number of frames to 64, 200, and 400. In these settings, our SSM-based model can considerably save memory consumption for longer sequences, while maintaining competitive FVD scores to the attention-based models. Our codes are available at this https URL .
- [1166] arXiv:2403.07718 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?Alexandre Drouin , Maxime Gasse , Massimo Caccia , Issam H. Laradji , Manuel Del Verme , Tom Marty , Léo Boisvert , Megh Thakkar , Quentin Cappart , David Vazquez , Nicolas Chapados , Alexandre LacosteComments: 27 pages, 10 figures, preprintSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 29 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.
- [1167] arXiv:2403.07720 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Multi-modal Auto-regressive Modeling via Visual WordsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification. In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time. Specifically, we propose the concept of visual words, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information. Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.
- [1168] arXiv:2403.07724 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Balancing Fairness and Accuracy in Data-Restricted Binary ClassificationZachary McBride Lazri , Danial Dervovic , Antigoni Polychroniadou , Ivan Brugere , Dana Dachman-Soled , Min WuSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
Abstract: Applications that deal with sensitive information may have restrictions placed on the data available to a machine learning (ML) classifier. For example, in some applications, a classifier may not have direct access to sensitive attributes, affecting its ability to produce accurate and fair decisions. This paper proposes a framework that models the trade-off between accuracy and fairness under four practical scenarios that dictate the type of data available for analysis. Prior works examine this trade-off by analyzing the outputs of a scoring function that has been trained to implicitly learn the underlying distribution of the feature vector, class label, and sensitive attribute of a dataset. In contrast, our framework directly analyzes the behavior of the optimal Bayesian classifier on this underlying distribution by constructing a discrete approximation it from the dataset itself. This approach enables us to formulate multiple convex optimization problems, which allow us to answer the question: How is the accuracy of a Bayesian classifier affected in different data restricting scenarios when constrained to be fair? Analysis is performed on a set of fairness definitions that include group and individual fairness. Experiments on three datasets demonstrate the utility of the proposed framework as a tool for quantifying the trade-offs among different fairness notions and their distributional dependencies.
- [1169] arXiv:2403.07733 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DSEG-LIME -- Improving Image Explanation by Hierarchical Data-Driven SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Explainable Artificial Intelligence is critical in unraveling decision-making processes in complex machine learning models. LIME (Local Interpretable Model-agnostic Explanations) is a well-known XAI framework for image analysis. It utilizes image segmentation to create features to identify relevant areas for classification. Consequently, poor segmentation can compromise the consistency of the explanation and undermine the importance of the segments, affecting the overall interpretability. Addressing these challenges, we introduce DSEG-LIME (Data-Driven Segmentation LIME), featuring: i) a data-driven segmentation for human-recognized feature generation, and ii) a hierarchical segmentation procedure through composition. We benchmark DSEG-LIME on pre-trained models with images from the ImageNet dataset - scenarios without domain-specific knowledge. The analysis includes a quantitative evaluation using established XAI metrics, complemented by a qualitative assessment through a user study. Our findings demonstrate that DSEG outperforms in most of the XAI metrics and enhances the alignment of explanations with human-recognized concepts, significantly improving interpretability. The code is available under: https://github. com/patrick-knab/DSEG-LIME
- [1170] arXiv:2403.07741 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Uncertainty Quantification with Deep Ensembles for 6D Object Pose EstimationComments: 8 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The estimation of 6D object poses is a fundamental task in many computer vision applications. Particularly, in high risk scenarios such as human-robot interaction, industrial inspection, and automation, reliable pose estimates are crucial. In the last years, increasingly accurate and robust deep-learning-based approaches for 6D object pose estimation have been proposed. Many top-performing methods are not end-to-end trainable but consist of multiple stages. In the context of deep uncertainty quantification, deep ensembles are considered as state of the art since they have been proven to produce well-calibrated and robust uncertainty estimates. However, deep ensembles can only be applied to methods that can be trained end-to-end. In this work, we propose a method to quantify the uncertainty of multi-stage 6D object pose estimation approaches with deep ensembles. For the implementation, we choose SurfEmb as representative, since it is one of the top-performing 6D object pose estimation approaches in the BOP Challenge 2022. We apply established metrics and concepts for deep uncertainty quantification to evaluate the results. Furthermore, we propose a novel uncertainty calibration score for regression tasks to quantify the quality of the estimated uncertainty.
- [1171] arXiv:2403.07743 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Equipping Computational Pathology Systems with Artifact Processing Pipelines: A Showcase for Computation and Performance Trade-offsNeel Kanwal , Farbod Khoraminia , Umay Kiraz , Andres Mosquera-Zamudio , Carlos Monteagudo , Emiel A.M. Janssen , Tahlita C.M. Zuiverloon , Chunmig Rong , Kjersti EnganComments: Submitted to BMC Medical Informatics and Decision Making JournalSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Histopathology is a gold standard for cancer diagnosis under a microscopic examination. However, histological tissue processing procedures result in artifacts, which are ultimately transferred to the digitized version of glass slides, known as whole slide images (WSIs). Artifacts are diagnostically irrelevant areas and may result in wrong deep learning (DL) algorithms predictions. Therefore, detecting and excluding artifacts in the computational pathology (CPATH) system is essential for reliable automated diagnosis. In this paper, we propose a mixture of experts (MoE) scheme for detecting five notable artifacts, including damaged tissue, blur, folded tissue, air bubbles, and histologically irrelevant blood from WSIs. First, we train independent binary DL models as experts to capture particular artifact morphology. Then, we ensemble their predictions using a fusion mechanism. We apply probabilistic thresholding over the final probability distribution to improve the sensitivity of the MoE. We developed DL pipelines using two MoEs and two multiclass models of state-of-the-art deep convolutional neural networks (DCNNs) and vision transformers (ViTs). DCNNs-based MoE and ViTs-based MoE schemes outperformed simpler multiclass models and were tested on datasets from different hospitals and cancer types, where MoE using DCNNs yielded the best results. The proposed MoE yields 86.15% F1 and 97.93% sensitivity scores on unseen data, retaining less computational cost for inference than MoE using ViTs. This best performance of MoEs comes with relatively higher computational trade-offs than multiclass models. The proposed artifact detection pipeline will not only ensure reliable CPATH predictions but may also provide quality control.
- [1172] arXiv:2403.07745 (cross-list from stat.ML) [ pdf , ps , other ]
-
Title: Probabilistic Easy Variational Causal EffectComments: 45 pages, 9 FiguresSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Let $X$ and $Z$ be random vectors, and $Y=g(X,Z)$. In this paper, on the one hand, for the case that $X$ and $Z$ are continuous, by using the ideas from the total variation and the flux of $g$, we develop a point of view in causal inference capable of dealing with a broad domain of causal problems. Indeed, we focus on a function, called Probabilistic Easy Variational Causal Effect (PEACE), which can measure the direct causal effect of $X$ on $Y$ with respect to continuously and interventionally changing the values of $X$ while keeping the value of $Z$ constant. PEACE is a function of $d\ge 0$, which is a degree managing the strengths of probability density values $f(x|z)$. On the other hand, we generalize the above idea for the discrete case and show its compatibility with the continuous case. Further, we investigate some properties of PEACE using measure theoretical concepts. Furthermore, we provide some identifiability criteria and several examples showing the generic capability of PEACE. We note that PEACE can deal with the causal problems for which micro-level or just macro-level changes in the value of the input variables are important. Finally, PEACE is stable under small changes in $\partial g_{in}/\partial x$ and the joint distribution of $X$ and $Z$, where $g_{in}$ is obtained from $g$ by removing all functional relationships defining $X$ and $Z$.
- [1173] arXiv:2403.07747 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.
- [1174] arXiv:2403.07748 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Ariadne and Theseus: Exploration and Rendezvous with Two Mobile Agents in an Unknown GraphSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI)
Abstract: We investigate two fundamental problems in mobile computing: exploration and rendezvous, with two distinct mobile agents in an unknown graph. The agents can read and write information on whiteboards that are located at all nodes. They both move along one adjacent edge at every time-step. In the exploration problem, both agents start from the same node of the graph and must traverse all of its edges. We show that a simple variant of depth-first search achieves collective exploration in $m$ synchronous time-steps, where $m$ is the number of edges of the graph. This improves the competitive ratio of collective graph exploration. In the rendezvous problem, the agents start from different nodes of the graph and must meet as fast as possible. We introduce an algorithm guaranteeing rendezvous in at most $\frac{3}{2}m$ time-steps. This improves over the so-called `wait for Mommy' algorithm which requires $2m$ time-steps. All our guarantees are derived from a more general asynchronous setting in which the speeds of the agents are controlled by an adversary at all times. Our guarantees also generalize to weighted graphs, if the number of edges $m$ is replaced by the sum of all edge lengths.
- [1175] arXiv:2403.07750 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image EmbeddingsSahand Sharifzadeh , Christos Kaplanis , Shreya Pathak , Dharshan Kumaran , Anastasija Ilic , Jovana Mitrovic , Charles Blundell , Andrea BaninoComments: 9 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.
- [1176] arXiv:2403.07788 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: DexCap: Scalable and Portable Mocap Data Collection System for Dexterous ManipulationSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Imitation learning from human hand motion data presents a promising avenue for imbuing robots with human-like dexterity in real-world manipulation tasks. Despite this potential, substantial challenges persist, particularly with the portability of existing hand motion capture (mocap) systems and the difficulty of translating mocap data into effective control policies. To tackle these issues, we introduce DexCap, a portable hand motion capture system, alongside DexIL, a novel imitation algorithm for training dexterous robot skills directly from human hand mocap data. DexCap offers precise, occlusion-resistant tracking of wrist and finger motions based on SLAM and electromagnetic field together with 3D observations of the environment. Utilizing this rich dataset, DexIL employs inverse kinematics and point cloud-based imitation learning to replicate human actions with robot hands. Beyond learning from human motion, DexCap also offers an optional human-in-the-loop correction mechanism to refine and further improve robot performance. Through extensive evaluation across six dexterous manipulation tasks, our approach not only demonstrates superior performance but also showcases the system's capability to effectively learn from in-the-wild mocap data, paving the way for future data collection methods for dexterous manipulation. More details can be found at this https URL
- [1177] arXiv:2403.07797 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Joint Selection: Adaptively Incorporating Public Information for Private Synthetic DataSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism jam-pgm, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that jam-pgm is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.
- [1178] arXiv:2403.07805 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Beyond Memorization: The Challenge of Random Memory Access in Language ModelsComments: 8 pages, 4 figures; fixed typosSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent developments in Language Models (LMs) have shown their effectiveness in NLP tasks, particularly in knowledge-intensive tasks. However, the mechanisms underlying knowledge storage and memory access within their parameters remain elusive. In this paper, we investigate whether a generative LM (e.g., GPT-2) is able to access its memory sequentially or randomly. Through carefully-designed synthetic tasks, covering the scenarios of full recitation, selective recitation and grounded question answering, we reveal that LMs manage to sequentially access their memory while encountering challenges in randomly accessing memorized content. We find that techniques including recitation and permutation improve the random memory access capability of LMs. Furthermore, by applying this intervention to realistic scenarios of open-domain question answering, we validate that enhancing random access by recitation leads to notable improvements in question answering. The code to reproduce our experiments can be found at this https URL .
- [1179] arXiv:2403.07815 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Chronos: Learning the Language of Time SeriesAbdul Fatir Ansari , Lorenzo Stella , Caner Turkmen , Xiyuan Zhang , Pedro Mercado , Huibin Shen , Oleksandr Shchur , Syama Sundar Rangapuram , Sebastian Pineda Arango , Shubham Kapoor , Jasper Zschiegner , Danielle C. Maddix , Hao Wang , Michael W. Mahoney , Kari Torkkola , Andrew Gordon Wilson , Michael Bohlke-Schneider , Yuyang WangComments: Code and model checkpoints available at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M parameters) on a large collection of publicly available datasets, complemented by a synthetic dataset that we generated via Gaussian processes to improve generalization. In a comprehensive benchmark consisting of 42 datasets, and comprising both classical local models and deep learning methods, we show that Chronos models: (a) significantly outperform other methods on datasets that were part of the training corpus; and (b) have comparable and occasionally superior zero-shot performance on new datasets, relative to methods that were trained specifically on them. Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines.
- [1180] arXiv:2403.07816 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLMSainbayar Sukhbaatar , Olga Golovneva , Vasu Sharma , Hu Xu , Xi Victoria Lin , Baptiste Rozière , Jacob Kahn , Daniel Li , Wen-tau Yih , Jason Weston , Xian LiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.
- [1181] arXiv:2403.07818 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Label Dropout: Improved Deep Learning Echocardiography Segmentation Using Multiple Datasets With Domain Shift and Partial LabellingIman Islam (1), Esther Puyol-Antón (1), Bram Ruijsink (1), Andrew J. Reader (1), Andrew P. King (1) ((1) King's College London)Comments: 10 pages, 5 figures, submitted to MICCAI conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Echocardiography (echo) is the first imaging modality used when assessing cardiac function. The measurement of functional biomarkers from echo relies upon the segmentation of cardiac structures and deep learning models have been proposed to automate the segmentation process. However, in order to translate these tools to widespread clinical use it is important that the segmentation models are robust to a wide variety of images (e.g. acquired from different scanners, by operators with different levels of expertise etc.). To achieve this level of robustness it is necessary that the models are trained with multiple diverse datasets. A significant challenge faced when training with multiple diverse datasets is the variation in label presence, i.e. the combined data are often partially-labelled. Adaptations of the cross entropy loss function have been proposed to deal with partially labelled data. In this paper we show that training naively with such a loss function and multiple diverse datasets can lead to a form of shortcut learning, where the model associates label presence with domain characteristics, leading to a drop in performance. To address this problem, we propose a novel label dropout scheme to break the link between domain characteristics and the presence or absence of labels. We demonstrate that label dropout improves echo segmentation Dice score by 62% and 25% on two cardiac structures when training using multiple diverse partially labelled datasets.
- [1182] arXiv:2403.07839 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error MetricComments: 18 pages, 8 figures, Published in CVPR2024Journal-ref: In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However, their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks. In this paper, we first propose the Module-wise Pruning Error (MoPE) metric, accurately assessing CLIP module importance by performance decline on cross-modal tasks. Using the MoPE metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge from the teacher model, significantly reducing pre-training costs while maintaining strong zero-shot capabilities. For fine-tuning, consecutive pruning from width to depth yields highly competitive task-specific models. Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric, and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.
- [1183] arXiv:2403.07865 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Exploring Safety Generalization Challenges of Large Language Models via CodeSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
Abstract: The rapid advancement of Large Language Models (LLMs) has brought about remarkable generative capabilities but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs, presenting a novel environment for testing the safety generalization of LLMs. Our comprehensive studies on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a common safety vulnerability of these models against code input: CodeAttack bypasses the safety guardrails of all models more than 80% of the time. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization, such as encoding natural language input with data structures. Furthermore, we give two hypotheses about the success of CodeAttack: (1) the misaligned bias acquired by LLMs during code training, prioritizing code completion over avoiding the potential safety risk; (2) the limited self-evaluation capability regarding the safety of their code outputs. Finally, we analyze potential mitigation measures. These findings highlight new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.
- [1184] arXiv:2403.07869 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: TeleMoMa: A Modular and Versatile Teleoperation System for Mobile ManipulationShivin Dass , Wensi Ai , Yuqian Jiang , Samik Singh , Jiaheng Hu , Ruohan Zhang , Peter Stone , Ben Abbatematteo , Roberto Martín-MartínSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: A critical bottleneck limiting imitation learning in robotics is the lack of data. This problem is more severe in mobile manipulation, where collecting demonstrations is harder than in stationary manipulation due to the lack of available and easy-to-use teleoperation interfaces. In this work, we demonstrate TeleMoMa, a general and modular interface for whole-body teleoperation of mobile manipulators. TeleMoMa unifies multiple human interfaces including RGB and depth cameras, virtual reality controllers, keyboard, joysticks, etc., and any combination thereof. In its more accessible version, TeleMoMa works using simply vision (e.g., an RGB-D camera), lowering the entry bar for humans to provide mobile manipulation demonstrations. We demonstrate the versatility of TeleMoMa by teleoperating several existing mobile manipulators - PAL Tiago++, Toyota HSR, and Fetch - in simulation and the real world. We demonstrate the quality of the demonstrations collected with TeleMoMa by training imitation learning policies for mobile manipulation tasks involving synchronized whole-body motion. Finally, we also show that TeleMoMa's teleoperation channel enables teleoperation on site, looking at the robot, or remote, sending commands and observations through a computer network, and perform user studies to evaluate how easy it is for novice users to learn to collect demonstrations with different combinations of human interfaces enabled by our system. We hope TeleMoMa becomes a helpful tool for the community enabling researchers to collect whole-body mobile manipulation demonstrations. For more information and video results, this https URL .
- [1185] arXiv:2403.07879 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: AI incidents and 'networked trouble': The case for a research agendaJournal-ref: Big Data & Society, 2023, July - December 1 - 6Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Against a backdrop of widespread interest in how publics can participate in the design of AI, I argue for a research agenda focused on AI incidents - examples of AI going wrong and sparking controversy - and how they are constructed in online environments. I take up the example of an AI incident from September 2020, when a Twitter user created a 'horrible experiment' to demonstrate the racist bias of Twitter's algorithm for cropping images. This resulted in Twitter not only abandoning its use of that algorithm, but also disavowing its decision to use any algorithm for the task. I argue that AI incidents like this are a significant means for participating in AI systems that require further research. That research agenda, I argue, should focus on how incidents are constructed through networked online behaviours that I refer to as 'networked trouble', where formats for participation enable individuals and algorithms to interact in ways that others - including technology companies - come to know and come to care about. At stake, I argue, is an important mechanism for participating in the design and deployment of AI.
- [1186] arXiv:2403.07883 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch SelectionWei Ye , Chaoya Jiang , Haiyang Xu , Chenhao Ye , Chenliang Li , Ming Yan , Shikun Zhang , Songhang Huang , Fei HuangSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational inefficiencies caused by lengthy visual sequences. To address this challenge, we introduce an efficient VLP approach called TRIPS, which stands for Text-Relevant Image Patch Selection. TRIPS progressively reduces the visual sequence using a text-guided patch-selection layer in the visual backbone, thereby accelerating both training and inference processes. This patch-selection layer dynamically computes text-dependent visual attention, enabling it to identify attentive image tokens with text guidance and fuse inattentive ones in an end-to-end fashion. Importantly, TRIPS does not add any extra parameters and generalizes to most ViT-based VLP models. We incorporate TRIPS into three representative VLP models covering single-stream, dual-stream, and generative paradigms, and conduct extensive experiments on five widely-used multi-modal benchmark datasets. Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.
- [1187] arXiv:2403.07884 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Seg-metrics: a Python package to compute segmentation metricsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In response to a concerning trend of selectively emphasizing metrics in medical image segmentation (MIS) studies, we introduce \texttt{seg-metrics}, an open-source Python package for standardized MIS model evaluation. Unlike existing packages, \texttt{seg-metrics} offers user-friendly interfaces for various overlap-based and distance-based metrics, providing a comprehensive solution. \texttt{seg-metrics} supports multiple file formats and is easily installable through the Python Package Index (PyPI). With a focus on speed and convenience, \texttt{seg-metrics} stands as a valuable tool for efficient MIS model assessment.
- [1188] arXiv:2403.07885 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MOD-CL: Multi-label Object Detection with Constrained LossSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We introduce MOD-CL, a multi-label object detection framework that utilizes constrained loss in the training process to produce outputs that better satisfy the given requirements. In this paper, we use $\mathrm{MOD_{YOLO}}$, a multi-label object detection model built upon the state-of-the-art object detection model YOLOv8, which has been published in recent years. In Task 1, we introduce the Corrector Model and Blender Model, two new models that follow after the object detection process, aiming to generate a more constrained output. For Task 2, constrained losses have been incorporated into the $\mathrm{MOD_{YOLO}}$ architecture using Product T-Norm. The results show that these implementations are instrumental to improving the scores for both Task 1 and Task 2.
- [1189] arXiv:2403.07886 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: A Memetic Algorithm To Find a Hamiltonian Cycle in a Hamiltonian GraphSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
Abstract: We present a memetic algorithm (\maa) approach for finding a Hamiltonian cycle in a Hamiltonian graph. The \ma is based on a proven approach to the Asymmetric Travelling Salesman Problem (\atspp) that, in this contribution, is boosted by the introduction of more powerful local searches. Our approach also introduces a novel technique that sparsifies the input graph under consideration for Hamiltonicity and dynamically augments it during the search. Such a combined heuristic approach helps to prove Hamiltonicity by finding a Hamiltonian cycle in less time. In addition, we also employ a recently introduced polynomial-time reduction from the \hamcyc to the Symmetric \tsp, which is based on computing the transitive closure of the graph. Although our approach is a metaheuristic, i.e., it does not give a theoretical guarantee for finding a Hamiltonian cycle, we have observed that the method is successful in practice in verifying the Hamiltonicity of a larger number of instances from the \textit{Flinder University Hamiltonian Cycle Problem Challenge Set} (\fhcpsc), even for the graphs that have large treewidth. The experiments on the \fhcpscc instances and a computational comparison with five recent state-of-the-art baseline approaches show that the proposed method outperforms those for the majority of the instances in the \fhcpsc.
- [1190] arXiv:2403.07887 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot RepresentationsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Object-centric methods have seen significant progress in unsupervised decomposition of raw perception into rich object-like abstractions. However, limited ability to ground object semantics of the real world into the learned abstractions has hindered their adoption in downstream understanding applications. We present the Neural Slot Interpreter (NSI) that learns to ground and generate object semantics via slot representations. At the core of NSI is an XML-like programming language that uses simple syntax rules to organize the object semantics of a scene into object-centric program primitives. Then, an alignment model learns to ground program primitives into slots through a bi-level contrastive learning objective over a shared embedding space. Finally, we formulate the NSI program generator model to use the dense associations inferred from the alignment model to generate object-centric programs from slots. Experiments on bi-modal retrieval tasks demonstrate the efficacy of the learned alignments, surpassing set-matching-based predictors by a significant margin. Moreover, learning the program generator from grounded associations enhances the predictive power of slots. NSI generated programs demonstrate improved performance of object-centric learners on property prediction and object detection, and scale with real-world scene complexity.
- [1191] arXiv:2403.07888 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Cross-modality debiasing: using language to mitigate sub-population shifts in imagingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Sub-population shift is a specific type of domain shift that highlights changes in data distribution within specific sub-groups or populations between training and testing. Sub-population shift accounts for a significant source of algorithmic bias and calls for distributional robustness. Recent studies found inherent distributional robustness in multi-modality foundation models, such as the vision-language model CLIP, yet this robustness is vulnerable through parameter fine-tuning. In this paper, we propose leveraging the connection of robustness among different modalities and reshaping the distributional robustness of one modality with another. Specifically, in the context of the distributional robustness of CLIP, we propose to leverage natural language inputs to debias the image feature representations, to improve worst-case performance on sub-populations. Our extensive empirical studies show that image representations debiased by natural language can achieve significant performance improvement and reduction of performance instability under sub-population shifts.
- [1192] arXiv:2403.07890 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: $\widetilde{O}(T^{-1})$ Convergence to (Coarse) Correlated Equilibria in Full-Information General-Sum Markov GamesSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analogous convergence results are scarce in Markov games, a more generic setting that lays the foundation for multi-agent reinforcement learning. In this work, we close this gap by showing that the optimistic-follow-the-regularized-leader (OFTRL) algorithm, together with appropriate value update procedures, can find $\widetilde{O}(T^{-1})$-approximate (coarse) correlated equilibria in full-information general-sum Markov games within $T$ iterations. Numerical results are also included to corroborate our theoretical findings.
- [1193] arXiv:2403.07904 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Addressing the Regulatory Gap: Moving Towards an EU AI Audit Ecosystem Beyond the AIA by Including Civil SocietySubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The European legislature has proposed the Digital Services Act (DSA) and Artificial Intelligence Act (AIA) to regulate platforms and Artificial Intelligence (AI) products. We review to what extent third-party audits are part of both laws and to what extent access to models and data is provided. By considering the value of third-party audits and third-party data access in an audit ecosystem, we identify a regulatory gap in that the Artificial Intelligence Act does not provide access to data for researchers and civil society. Our contributions to the literature include: (1) Defining an AI audit ecosystem that incorporates compliance and oversight. (2) Highlighting a regulatory gap within the DSA and AIA regulatory framework, preventing the establishment of an AI audit ecosystem. (3) Emphasizing that third-party audits by research and civil society must be part of that ecosystem and demand that the AIA include data and model access for certain AI products. We call for the DSA to provide NGOs and investigative journalists with data access to platforms by delegated acts and for adaptions and amendments of the AIA to provide third-party audits and data and model access at least for high-risk systems to close the regulatory gap. Regulations modeled after European Union AI regulations should enable data access and third-party audits, fostering an AI audit ecosystem that promotes compliance and oversight mechanisms.
- [1194] arXiv:2403.07905 (cross-list from cs.DC) [ pdf , ps , other ]
-
Title: Enhancing Kubernetes Automated Scheduling with Deep Learning and Reinforcement Techniques for Large-Scale Cloud Computing OptimizationSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: With the continuous expansion of the scale of cloud computing applications, artificial intelligence technologies such as Deep Learning and Reinforcement Learning have gradually become the key tools to solve the automated task scheduling of large-scale cloud computing systems. Aiming at the complexity and real-time requirement of task scheduling in large-scale cloud computing system, this paper proposes an automatic task scheduling scheme based on deep learning and reinforcement learning. Firstly, the deep learning technology is used to monitor and predict the parameters in the cloud computing system in real time to obtain the system status information. Then, combined with reinforcement learning algorithm, the task scheduling strategy is dynamically adjusted according to the real-time system state and task characteristics to achieve the optimal utilization of system resources and the maximum of task execution efficiency. This paper verifies the effectiveness and performance advantages of the proposed scheme in experiments, and proves the potential and application prospect of deep learning and reinforcement learning in automatic task scheduling in large-scale cloud computing systems.
- [1195] arXiv:2403.07911 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Standing on FURM ground -- A framework for evaluating Fair, Useful, and Reliable AI Models in healthcare systemsAlison Callahan , Duncan McElfresh , Juan M. Banda , Gabrielle Bunney , Danton Char , Jonathan Chen , Conor K. Corbin , Debadutta Dash , Norman L. Downing , Sneha S. Jain , Nikesh Kotecha , Jonathan Masterson , Michelle M. Mello , Keith Morse , Srikar Nallan , Abby Pandya , Anurang Revri , Aditya Sharma , Christopher Sharp , Rahul Thapa , Michael Wornow , Alaa Youssef , Michael A. Pfeffer , Nigam H. ShahSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: The impact of using artificial intelligence (AI) to guide patient care or operational processes is an interplay of the AI model's output, the decision-making protocol based on that output, and the capacity of the stakeholders involved to take the necessary subsequent action. Estimating the effects of this interplay before deployment, and studying it in real time afterwards, are essential to bridge the chasm between AI model development and achievable benefit. To accomplish this, the Data Science team at Stanford Health Care has developed a Testing and Evaluation (T&E) mechanism to identify fair, useful and reliable AI models (FURM) by conducting an ethical review to identify potential value mismatches, simulations to estimate usefulness, financial projections to assess sustainability, as well as analyses to determine IT feasibility, design a deployment strategy, and recommend a prospective monitoring and evaluation plan. We report on FURM assessments done to evaluate six AI guided solutions for potential adoption, spanning clinical and operational settings, each with the potential to impact from several dozen to tens of thousands of patients each year. We describe the assessment process, summarize the six assessments, and share our framework to enable others to conduct similar assessments. Of the six solutions we assessed, two have moved into a planning and implementation phase. Our novel contributions - usefulness estimates by simulation, financial projections to quantify sustainability, and a process to do ethical assessments - as well as their underlying methods and open source tools, are available for other healthcare systems to conduct actionable evaluations of candidate AI solutions.
- [1196] arXiv:2403.07918 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: On the Societal Impact of Open Foundation ModelsSayash Kapoor , Rishi Bommasani , Kevin Klyman , Shayne Longpre , Ashwin Ramaswami , Peter Cihon , Aspen Hopkins , Kevin Bankston , Stella Biderman , Miranda Bogen , Rumman Chowdhury , Alex Engler , Peter Henderson , Yacine Jernite , Seth Lazar , Stefano Maffulli , Alondra Nelson , Joelle Pineau , Aviya Skowron , Dawn Song , Victor Storchan , Daniel Zhang , Daniel E. Ho , Percy Liang , Arvind NarayananSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Foundation models are powerful technologies: how they are released publicly directly shapes their societal impact. In this position paper, we focus on open foundation models, defined here as those with broadly available model weights (e.g. Llama 2, Stable Diffusion XL). We identify five distinctive properties (e.g. greater customizability, poor monitoring) of open foundation models that lead to both their benefits and risks. Open foundation models present significant benefits, with some caveats, that span innovation, competition, the distribution of decision-making power, and transparency. To understand their risks of misuse, we design a risk assessment framework for analyzing their marginal risk. Across several misuse vectors (e.g. cyberattacks, bioweapons), we find that current research is insufficient to effectively characterize the marginal risk of open foundation models relative to pre-existing technologies. The framework helps explain why the marginal risk is low in some cases, clarifies disagreements about misuse risks by revealing that past work has focused on different subsets of the framework with different assumptions, and articulates a way forward for more constructive debate. Overall, our work helps support a more grounded assessment of the societal impact of open foundation models by outlining what research is needed to empirically validate their theoretical benefits and risks.
- [1197] arXiv:2403.07920 (cross-list from q-bio.BM) [ pdf , ps , html , other ]
-
Title: ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-TrainingLe Zhuo , Zewen Chi , Minghao Xu , Heyan Huang , Heqi Zheng , Conghui He , Xian-Ling Mao , Wentao ZhangComments: this https URLSubjects: Biomolecules (q-bio.BM) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: We propose ProtLLM, a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs where the natural language text is interspersed with an arbitrary number of proteins. Besides, we propose the protein-as-word language modeling approach to train ProtLLM. By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates. Additionally, we construct a large-scale interleaved protein-text dataset, named InterPT, for pre-training. This dataset comprehensively encompasses both (1) structured data sources like protein annotations and (2) unstructured data sources like biological research papers, thereby endowing ProtLLM with crucial knowledge for understanding proteins. We evaluate ProtLLM on classic supervised protein-centric tasks and explore its novel protein-language applications. Experimental results demonstrate that ProtLLM not only achieves superior performance against protein-specialized baselines on protein-centric tasks but also induces zero-shot and in-context learning capabilities on protein-language tasks.
- [1198] arXiv:2403.07921 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Merino: Entropy-driven Design for Generative Language Models on IoT DevicesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, directly deploying LLMs in resource-constrained hardware, such as Internet-of-Things (IoT) devices, is difficult due to their high computational cost. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. Our key design paradigm is to maximize the entropy of transformer decoders within the given computational budgets. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across nine NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better zero performance compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size. Code will be made available soon.
- [1199] arXiv:2403.07923 (cross-list from cs.NI) [ pdf , ps , other ]
-
Title: The Fusion of Deep Reinforcement Learning and Edge Computing for Real-time Monitoring and Control Optimization in IoT EnvironmentsSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
Abstract: In response to the demand for real-time performance and control quality in industrial Internet of Things (IoT) environments, this paper proposes an optimization control system based on deep reinforcement learning and edge computing. The system leverages cloud-edge collaboration, deploys lightweight policy networks at the edge, predicts system states, and outputs controls at a high frequency, enabling monitoring and optimization of industrial objectives. Additionally, a dynamic resource allocation mechanism is designed to ensure rational scheduling of edge computing resources, achieving global optimization. Results demonstrate that this approach reduces cloud-edge communication latency, accelerates response to abnormal situations, reduces system failure rates, extends average equipment operating time, and saves costs for manual maintenance and replacement. This ensures real-time and stable control.
- [1200] arXiv:2403.07924 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: AI and IdentityComments: 10 pages, 4 figures, AAAI Spring SymposiumSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: AI-empowered technologies' impact on the world is undeniable, reshaping industries, revolutionizing how humans interact with technology, transforming educational paradigms, and redefining social codes. However, this rapid growth is accompanied by two notable challenges: a lack of diversity within the AI field and a widening AI divide. In this context, This paper examines the intersection of AI and identity as a pathway to understand biases, inequalities, and ethical considerations in AI development and deployment. We present a multifaceted definition of AI identity, which encompasses its creators, applications, and their broader impacts. Understanding AI's identity involves understanding the associations between the individuals involved in AI's development, the technologies produced, and the social, ethical, and psychological implications. After exploring the AI identity ecosystem and its societal dynamics, We propose a framework that highlights the need for diversity in AI across three dimensions: Creators, Creations, and Consequences through the lens of identity. This paper proposes the need for a comprehensive approach to fostering a more inclusive and responsible AI ecosystem through the lens of identity.
- [1201] arXiv:2403.07932 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Feint in Multi-Player GamesSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces the first formalization, implementation and quantitative evaluation of Feint in Multi-Player Games. Our work first formalizes Feint from the perspective of Multi-Player Games, in terms of the temporal, spatial, and their collective impacts. The formalization is built upon Non-transitive Active Markov Game Model, where Feint can have a considerable amount of impacts. Then, our work considers practical implementation details of Feint in Multi-Player Games, under the state-of-the-art progress of multi-agent modeling to date (namely Multi-Agent Reinforcement Learning). Finally, our work quantitatively examines the effectiveness of our design, and the results show that our design of Feint can (1) greatly improve the reward gains from the game; (2) significantly improve the diversity of Multi-Player Games; and (3) only incur negligible overheads in terms of time consumption. We conclude that our design of Feint is effective and practical, to make Multi-Player Games more interesting.
- [1202] arXiv:2403.07938 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Text-to-Audio Generation Synchronized with VideosComments: arXiv admin note: text overlap with arXiv:2305.12903Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often results in discernible audio-visual mismatches. To bridge this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself with three novel metrics dedicated to evaluating visual alignment and temporal consistency. To complement this, we also present a simple yet effective video-aligned TTA generation model, namely T2AV. Moving beyond traditional methods, T2AV refines the latent diffusion approach by integrating visual-aligned text embeddings as its conditional foundation. It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet that adeptly merges temporal visual representations with text embeddings. Further enhancing this integration, we weave in a contrastive learning objective, designed to ensure that the visual-aligned text embeddings resonate closely with the audio features. Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.
- [1203] arXiv:2403.07944 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image InputsComments: 11 pages, 2 figures, 2 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation. The first part employs the capabilities of ChatGPT to meticulously distill and proactively construct precise prompts for each subsequent step, thereby guaranteeing the utmost accuracy in prompt communication and accurate execution in following model operations. The second part employ compatible with existing advanced diffusion techniques to expansively generate and refine the key frame at the conclusion of a video. Then we can expertly harness the power of leading and trailing key frames to craft videos with enhanced temporal consistency and action smoothness. The experimental results confirm that our method has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.
- [1204] arXiv:2403.07949 (cross-list from cs.GT) [ pdf , ps , other ]
-
Title: Algorithmic Bayesian EpistemologyComments: 385 pages, PhD thesis, 14 figures, 4 tablesSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: One aspect of the algorithmic lens in theoretical computer science is a view on other scientific disciplines that focuses on satisfactory solutions that adhere to real-world constraints, as opposed to solutions that would be optimal ignoring such constraints. The algorithmic lens has provided a unique and important perspective on many academic fields, including molecular biology, ecology, neuroscience, quantum physics, economics, and social science.
This thesis applies the algorithmic lens to Bayesian epistemology. Traditional Bayesian epistemology provides a comprehensive framework for how an individual's beliefs should evolve upon receiving new information. However, these methods typically assume an exhaustive model of such information, including the correlation structure between different pieces of evidence. In reality, individuals might lack such an exhaustive model, while still needing to form beliefs. Beyond such informational constraints, an individual may be bounded by limited computation, or by limited communication with agents that have access to information, or by the strategic behavior of such agents. Even when these restrictions prevent the formation of a *perfectly* accurate belief, arriving at a *reasonably* accurate belief remains crucial. In this thesis, we establish fundamental possibility and impossibility results about belief formation under a variety of restrictions, and lay the groundwork for further exploration. - [1205] arXiv:2403.07952 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AesopAgent: Agent-driven Evolutionary System on Story-to-Video ProductionJiuniu Wang , Zehua Du , Yuyuan Zhao , Bo Yuan , Kexiang Wang , Jian Liang , Yaxi Zhao , Yihen Lu , Gengliang Li , Junlong Gao , Xin Tu , Zhenyu GuoComments: 22 pages, 13 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: The Agent and AIGC (Artificial Intelligence Generated Content) technologies have recently made significant progress. We propose AesopAgent, an Agent-driven Evolutionary System on Story-to-Video Production. AesopAgent is a practical application of agent technology for multimodal content generation. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. This innovative system would convert user story proposals into scripts, images, and audio, and then integrate these multimodal contents into videos. Additionally, the animating units (e.g., Gen-2 and Sora) could make the videos more infectious. The AesopAgent system could orchestrate task workflow for video generation, ensuring that the generated video is both rich in content and coherent. This system mainly contains two layers, i.e., the Horizontal Layer and the Utility Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within the workflow. It continuously evolves and iteratively optimizes workflow by accumulating expert experience and professional knowledge, including optimizing the LLM prompts and utilities usage. The Utility Layer provides multiple utilities, leading to consistent image generation that is visually coherent in terms of composition, characters, and style. Meanwhile, it provides audio and special effects, integrating them into expressive and logically arranged videos. Overall, our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling. Our AesopAgent is designed for convenient service for individual users, which is available on the following page: this https URL .
- [1206] arXiv:2403.07953 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Abstracting Sparse DNN Acceleration via Structured Sparse Tensor DecompositionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Abstract: Exploiting sparsity in deep neural networks (DNNs) has been a promising area to meet the growing computation need of modern DNNs. However, in practice, sparse DNN acceleration still faces a key challenge. To minimize the overhead of sparse acceleration, hardware designers have proposed structured sparse hardware support recently, which provides limited flexibility and requires extra model fine-tuning. Moreover, any sparse model fine-tuned for certain structured sparse hardware cannot be accelerated by other structured hardware. To bridge the gap between sparse DNN models and hardware, this paper proposes tensor approximation via structured decomposition (TASD), which leverages the distributive property in linear algebra to turn any sparse tensor into a series of structured sparse tensors. Next, we develop a software framework, TASDER, to accelerate DNNs by searching layer-wise, high-quality structured decomposition for both weight and activation tensors so that they can be accelerated by any systems with structured sparse hardware support. Evaluation results show that, by exploiting prior structured sparse hardware baselines, our method can accelerate off-the-shelf dense and sparse DNNs without fine-tuning and improves energy-delay-product by up to 83% and 74% on average.
- [1207] arXiv:2403.07955 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Faithful Explanations: Boosting Rationalization with Shortcuts DiscoveryComments: Accepted to ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The remarkable success in neural networks provokes the selective rationalization. It explains the prediction results by identifying a small subset of the inputs sufficient to support them. Since existing methods still suffer from adopting the shortcuts in data to compose rationales and limited large-scale annotated rationales by human, in this paper, we propose a Shortcuts-fused Selective Rationalization (SSR) method, which boosts the rationalization by discovering and exploiting potential shortcuts. Specifically, SSR first designs a shortcuts discovery approach to detect several potential shortcuts. Then, by introducing the identified shortcuts, we propose two strategies to mitigate the problem of utilizing shortcuts to compose rationales. Finally, we develop two data augmentations methods to close the gap in the number of annotated rationales. Extensive experimental results on real-world datasets clearly validate the effectiveness of our proposed method.
- [1208] arXiv:2403.07956 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DeepCDCL: An CDCL-based Neural Network Verification FrameworkSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Neural networks in safety-critical applications face increasing safety and security concerns due to their susceptibility to little disturbance. In this paper, we propose DeepCDCL, a novel neural network verification framework based on the Conflict-Driven Clause Learning (CDCL) algorithm. We introduce an asynchronous clause learning and management structure, reducing redundant time consumption compared to the direct application of the CDCL framework. Furthermore, we also provide a detailed evaluation of the performance of our approach on the ACAS Xu and MNIST datasets, showing that a significant speed-up is achieved in most cases.
- [1209] arXiv:2403.07957 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Efficient Post-Training Augmentation for Adaptive Inference in Heterogeneous and Distributed IoT EnvironmentsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Early Exit Neural Networks (EENNs) present a solution to enhance the efficiency of neural network deployments. However, creating EENNs is challenging and requires specialized domain knowledge, due to the large amount of additional design choices. To address this issue, we propose an automated augmentation flow that focuses on converting an existing model into an EENN. It performs all required design decisions for the deployment to heterogeneous or distributed hardware targets: Our framework constructs the EENN architecture, maps its subgraphs to the hardware targets, and configures its decision mechanism. To the best of our knowledge, it is the first framework that is able to perform all of these steps.
We evaluated our approach on a collection of Internet-of-Things and standard image classification use cases. For a speech command detection task, our solution was able to reduce the mean operations per inference by 59.67%. For an ECG classification task, it was able to terminate all samples early, reducing the mean inference energy by 74.9% and computations by 78.3%. On CIFAR-10, our solution was able to achieve up to a 58.75% reduction in computations.
The search on a ResNet-152 base model for CIFAR-10 took less than nine hours on a laptop CPU. Our proposed approach enables the creation of EENN optimized for IoT environments and can reduce the inference cost of Deep Learning applications on embedded and fog platforms, while also significantly reducing the search cost - making it more accessible for scientists and engineers in industry and research. The low search cost improves the accessibility of EENNs, with the potential to improve the efficiency of neural networks in a wide range of practical applications. - [1210] arXiv:2403.07958 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Temporal Decisions: Leveraging Temporal Correlation for Efficient Decisions in Early Exit Neural NetworksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Deep Learning is becoming increasingly relevant in Embedded and Internet-of-things applications. However, deploying models on embedded devices poses a challenge due to their resource limitations. This can impact the model's inference accuracy and latency. One potential solution are Early Exit Neural Networks, which adjust model depth dynamically through additional classifiers attached between their hidden layers. However, the real-time termination decision mechanism is critical for the system's efficiency, latency, and sustained accuracy.
This paper introduces Difference Detection and Temporal Patience as decision mechanisms for Early Exit Neural Networks. They leverage the temporal correlation present in sensor data streams to efficiently terminate the inference. We evaluate their effectiveness in health monitoring, image classification, and wake-word detection tasks. Our novel contributions were able to reduce the computational footprint compared to established decision mechanisms significantly while maintaining higher accuracy scores. We achieved a reduction of mean operations per inference by up to 80% while maintaining accuracy levels within 5% of the original model.
These findings highlight the importance of considering temporal correlation in sensor data to improve the termination decision. - [1211] arXiv:2403.07959 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: An Interpretable Generalization Mechanism for Accurately Detecting Anomaly and Identifying Networking Intrusion TechniquesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in Intrusion Detection Systems (IDS), integrating Explainable AI (XAI) methodologies, have led to notable improvements in system performance via precise feature selection. However, a thorough understanding of cyber-attacks requires inherently explainable decision-making processes within IDS. In this paper, we present the Interpretable Generalization Mechanism (IG), poised to revolutionize IDS capabilities. IG discerns coherent patterns, making it interpretable in distinguishing between normal and anomalous network traffic. Further, the synthesis of coherent patterns sheds light on intricate intrusion pathways, providing essential insights for cybersecurity forensics. By experiments with real-world datasets NSL-KDD, UNSW-NB15, and UKM-IDS20, IG is accurate even at a low ratio of training-to-test. With 10%-to-90%, IG achieves Precision (PRE)=0.93, Recall (REC)=0.94, and Area Under Curve (AUC)=0.94 in NSL-KDD; PRE=0.98, REC=0.99, and AUC=0.99 in UNSW-NB15; and PRE=0.98, REC=0.98, and AUC=0.99 in UKM-IDS20. Notably, in UNSW-NB15, IG achieves REC=1.0 and at least PRE=0.98 since 40%-to-60%; in UKM-IDS20, IG achieves REC=1.0 and at least PRE=0.88 since 20%-to-80%. Importantly, in UKM-IDS20, IG successfully identifies all three anomalous instances without prior exposure, demonstrating its generalization capabilities. These results and inferences are reproducible. In sum, IG showcases superior generalization by consistently performing well across diverse datasets and training-to-test ratios (from 10%-to-90% to 90%-to-10%), and excels in identifying novel anomalies without prior exposure. Its interpretability is enhanced by coherent evidence that accurately distinguishes both normal and anomalous activities, significantly improving detection accuracy and reducing false alarms, thereby strengthening IDS reliability and trustworthiness.
- [1212] arXiv:2403.07965 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Conditional computation in neural networks: principles and research trendsSimone Scardapane , Alessandro Baiocchi , Alessio Devoto , Valerio Marsocci , Pasquale Minervini , Jary PomponiComments: Under review at Intelligenza Artificiale (IOS Press)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This article summarizes principles and ideas from the emerging area of applying \textit{conditional computation} methods to the design of neural networks. In particular, we focus on neural networks that can dynamically activate or de-activate parts of their computational graph conditionally on their input. Examples include the dynamic selection of, e.g., input tokens, layers (or sets of layers), and sub-modules inside each layer (e.g., channels in a convolutional filter). We first provide a general formalism to describe these techniques in an uniform way. Then, we introduce three notable implementations of these principles: mixture-of-experts (MoEs) networks, token selection mechanisms, and early-exit neural networks. The paper aims to provide a tutorial-like introduction to this growing field. To this end, we analyze the benefits of these modular designs in terms of efficiency, explainability, and transfer learning, with a focus on emerging applicative areas ranging from automated scientific discovery to semantic communication.
- [1213] arXiv:2403.07968 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Do Deep Neural Network Solutions Form a Star Domain?Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Entezari et al. (2022) conjectured that neural network solution sets reachable via stochastic gradient descent (SGD) are convex, considering permutation invariances. This means that two independent solutions can be connected by a linear path with low loss, given one of them is appropriately permuted. However, current methods to test this theory often fail to eliminate loss barriers between two independent solutions (Ainsworth et al., 2022; Benzing et al., 2022). In this work, we conjecture that a more relaxed claim holds: the SGD solution set is a star domain that contains a star model that is linearly connected to all the other solutions via paths with low loss values, modulo permutations. We propose the Starlight algorithm that finds a star model of a given learning task. We validate our claim by showing that this star model is linearly connected with other independently found solutions. As an additional benefit of our study, we demonstrate better uncertainty estimates on Bayesian Model Averaging over the obtained star domain. Code is available at this https URL .
- [1214] arXiv:2403.07969 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: KnowCoder: Coding Structured Knowledge into LLMs for Universal Information ExtractionZixuan Li , Yutao Zeng , Yuxin Zuo , Weicheng Ren , Wenxuan Liu , Miao Su , Yucan Guo , Yantao Liu , Xiang Li , Zhilei Hu , Long Bai , Wei Li , Yidan Liu , Pan Yang , Xiaolong Jin , Jiafeng Guo , Xueqi ChengSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over $\textbf{30,000}$ types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around $1.5$B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by $\textbf{49.8%}$ F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to $\textbf{12.5%}$ and $\textbf{21.9%}$, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to $\textbf{7.5%}$ under the supervised setting.
- [1215] arXiv:2403.07979 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Do Agents Dream of Electric Sheep?: Improving Generalization in Reinforcement Learning through Generative LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The Overfitted Brain hypothesis suggests dreams happen to allow generalization in the human brain. Here, we ask if the same is true for reinforcement learning agents as well. Given limited experience in a real environment, we use imagination-based reinforcement learning to train a policy on dream-like episodes, where non-imaginative, predicted trajectories are modified through generative augmentations. Experiments on four ProcGen environments show that, compared to classic imagination and offline training on collected experience, our method can reach a higher level of generalization when dealing with sparsely rewarded environments.
- [1216] arXiv:2403.08004 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Pix2Pix-OnTheFly: Leveraging LLMs for Instruction-Guided Image EditingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort, in one way or other, to some form of preliminary preparation, training or fine-tuning, this paper explores a novel approach: We propose a preparation-free method that permits instruction-guided image editing on the fly. This approach is organized along three steps properly orchestrated that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by image editing proper. While dispensing with preliminary preparation, our approach demonstrates to be effective and competitive, outperforming recent, state of the art models for this task when evaluated on the MAGICBRUSH dataset.
- [1217] arXiv:2403.08011 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Gujarati-English Code-Switching Speech Recognition using ensemble prediction of spoken languageComments: Bachelor's thesis, 28 pages, includes appendixSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: An important and difficult task in code-switched speech recognition is to recognize the language, as lots of words in two languages can sound similar, especially in some accents. We focus on improving performance of end-to-end Automatic Speech Recognition models by conditioning transformer layers on language ID of words and character in the output in an per layer supervised manner. To this end, we propose two methods of introducing language specific parameters and explainability in the multi-head attention mechanism, and implement a Temporal Loss that helps maintain continuity in input alignment. Despite being unable to reduce WER significantly, our method shows promise in predicting the correct language from just spoken data. We introduce regularization in the language prediction by dropping LID in the sequence, which helps align long repeated output sequences.
- [1218] arXiv:2403.08017 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Red Teaming Models for Hyperspectral Image Analysis Using Explainable AIVladimir Zaigrajew , Hubert Baniecki , Lukasz Tulczyjew , Agata M. Wijata , Jakub Nalepa , Nicolas Longépé , Przemyslaw BiecekComments: 14 pages, 9 figures, ICLR 2024 Machine Learning for Remote Sensing (ML4RS) WorkshopSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Remote sensing (RS) applications in the space domain demand machine learning (ML) models that are reliable, robust, and quality-assured, making red teaming a vital approach for identifying and exposing potential flaws and biases. Since both fields advance independently, there is a notable gap in integrating red teaming strategies into RS. This paper introduces a methodology for examining ML models operating on hyperspectral images within the HYPERVIEW challenge, focusing on soil parameters' estimation. We use post-hoc explanation methods from the Explainable AI (XAI) domain to critically assess the best performing model that won the HYPERVIEW challenge and served as an inspiration for the model deployed on board the INTUITION-1 hyperspectral mission. Our approach effectively red teams the model by pinpointing and validating key shortcomings, constructing a model that achieves comparable performance using just 1% of the input features and a mere up to 5% performance loss. Additionally, we propose a novel way of visualizing explanations that integrate domain-specific information about hyperspectral bands (wavelengths) and data transformations to better suit interpreting models for hyperspectral image analysis.
- [1219] arXiv:2403.08032 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LG-Traj: LLM Guided Pedestrian Trajectory PredictionComments: Under ReviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Accurate pedestrian trajectory prediction is crucial for various applications, and it requires a deep understanding of pedestrian motion patterns in dynamic environments. However, existing pedestrian trajectory prediction methods still need more exploration to fully leverage these motion patterns. This paper investigates the possibilities of using Large Language Models (LLMs) to improve pedestrian trajectory prediction tasks by inducing motion cues. We introduce LG-Traj, a novel approach incorporating LLMs to generate motion cues present in pedestrian past/observed trajectories. Our approach also incorporates motion cues present in pedestrian future trajectories by clustering future trajectories of training data using a mixture of Gaussians. These motion cues, along with pedestrian coordinates, facilitate a better understanding of the underlying representation. Furthermore, we utilize singular value decomposition to augment the observed trajectories, incorporating them into the model learning process to further enhance representation learning. Our method employs a transformer-based architecture comprising a motion encoder to model motion patterns and a social decoder to capture social interactions among pedestrians. We demonstrate the effectiveness of our approach on popular pedestrian trajectory prediction benchmarks, namely ETH-UCY and SDD, and present various ablation experiments to validate our approach.
- [1220] arXiv:2403.08035 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech DetectionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) excel in many diverse applications beyond language generation, e.g., translation, summarization, and sentiment analysis. One intriguing application is in text classification. This becomes pertinent in the realm of identifying hateful or toxic speech -- a domain fraught with challenges and ethical dilemmas. In our study, we have two objectives: firstly, to offer a literature review revolving around LLMs as classifiers, emphasizing their role in detecting and classifying hateful or toxic content. Subsequently, we explore the efficacy of several LLMs in classifying hate speech: identifying which LLMs excel in this task as well as their underlying attributes and training. Providing insight into the factors that contribute to an LLM proficiency (or lack thereof) in discerning hateful content. By combining a comprehensive literature review with an empirical analysis, our paper strives to shed light on the capabilities and constraints of LLMs in the crucial domain of hate speech detection.
- [1221] arXiv:2403.08036 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: A Review of Cybersecurity Incidents in the Food and Agriculture SectorComments: Preprint. Submitted for journal publicationSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: The increasing utilization of emerging technologies in the Food & Agriculture (FA) sector has heightened the need for security to minimize cyber risks. Considering this aspect, this manuscript reviews disclosed and documented cybersecurity incidents in the FA sector. For this purpose, thirty cybersecurity incidents were identified, which took place between July 2011 and April 2023. The details of these incidents are reported from multiple sources such as: the private industry and flash notifications generated by the Federal Bureau of Investigation (FBI), internal reports from the affected organizations, and available media sources. Considering the available information, a brief description of the security threat, ransom amount, and impact on the organization are discussed for each incident. This review reports an increased frequency of cybersecurity threats to the FA sector. To minimize these cyber risks, popular cybersecurity frameworks and recent agriculture-specific cybersecurity solutions are also discussed. Further, the need for AI assurance in the FA sector is explained, and the Farmer-Centered AI (FCAI) framework is proposed. The main aim of the FCAI framework is to support farmers in decision-making for agricultural production, by incorporating AI assurance. Lastly, the effects of the reported cyber incidents on other critical infrastructures, food security, and the economy are noted, along with specifying the open issues for future development.
- [1222] arXiv:2403.08049 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical TasksComments: CHI 2024, supplementary materials: this https URLSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Mixed-media tutorials, which integrate videos, images, text, and diagrams to teach procedural skills, offer more browsable alternatives than timeline-based videos. However, manually creating such tutorials is tedious, and existing automated solutions are often restricted to a particular domain. While AI models hold promise, it is unclear how to effectively harness their powers, given the multi-modal data involved and the vast landscape of models. We present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. First, we distill common tutorial components by surveying existing work; then, we present an approach to identify, assemble, and evaluate AI models for component extraction; finally, we propose guidelines for designing user interfaces (UI) that support tutorial creation based on AI-generated components. We show that TutoAI has achieved higher or similar quality compared to a baseline model in preliminary user studies.
- [1223] arXiv:2403.08059 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FluoroSAM: A Language-aligned Foundation Model for X-ray Image SegmentationBenjamin D. Killeen , Liam J. Wang , Han Zhang , Mehran Armand , Russell H. Taylor , Dave Dreizin , Greg Osgood , Mathias UnberathSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Automated X-ray image segmentation would accelerate research and development in diagnostic and interventional precision medicine. Prior efforts have contributed task-specific models capable of solving specific image analysis problems, but the utility of these models is restricted to their particular task domain, and expanding to broader use requires additional data, labels, and retraining efforts. Recently, foundation models (FMs) -- machine learning models trained on large amounts of highly variable data thus enabling broad applicability -- have emerged as promising tools for automated image analysis. Existing FMs for medical image analysis focus on scenarios and modalities where objects are clearly defined by visually apparent boundaries, such as surgical tool segmentation in endoscopy. X-ray imaging, by contrast, does not generally offer such clearly delineated boundaries or structure priors. During X-ray image formation, complex 3D structures are projected in transmission onto the imaging plane, resulting in overlapping features of varying opacity and shape. To pave the way toward an FM for comprehensive and automated analysis of arbitrary medical X-ray images, we develop FluoroSAM, a language-aligned variant of the Segment-Anything Model, trained from scratch on 1.6M synthetic X-ray images. FluoroSAM is trained on data including masks for 128 organ types and 464 non-anatomical objects, such as tools and implants. In real X-ray images of cadaveric specimens, FluoroSAM is able to segment bony anatomical structures based on text-only prompting with 0.51 and 0.79 DICE with point-based refinement, outperforming competing SAM variants for all structures. FluoroSAM is also capable of zero-shot generalization to segmenting classes beyond the training set thanks to its language alignment, which we demonstrate for full lung segmentation on real chest X-rays.
- [1224] arXiv:2403.08077 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Multimodal Intermediate Fusion Network with Manifold Learning for Stress DetectionComments: This work was accepted to The 3rd International Conference on Computing and Machine Intelligence (ICMI 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multimodal deep learning methods capture synergistic features from multiple modalities and have the potential to improve accuracy for stress detection compared to unimodal methods. However, this accuracy gain typically comes from high computational cost due to the high-dimensional feature spaces, especially for intermediate fusion. Dimensionality reduction is one way to optimize multimodal learning by simplifying data and making the features more amenable to processing and analysis, thereby reducing computational complexity. This paper introduces an intermediate multimodal fusion network with manifold learning-based dimensionality reduction. The multimodal network generates independent representations from biometric signals and facial landmarks through 1D-CNN and 2D-CNN. Finally, these features are fused and fed to another 1D-CNN layer, followed by a fully connected dense layer. We compared various dimensionality reduction techniques for different variations of unimodal and multimodal networks. We observe that the intermediate-level fusion with the Multi-Dimensional Scaling (MDS) manifold method showed promising results with an accuracy of 96.00\% in a Leave-One-Subject-Out Cross-Validation (LOSO-CV) paradigm over other dimensional reduction methods. MDS had the highest computational cost among manifold learning methods. However, while outperforming other networks, it managed to reduce the computational cost of the proposed networks by 25\% when compared to six well-known conventional feature selection methods used in the preprocessing step.
- [1225] arXiv:2403.08081 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Mechanics of Next Token Prediction with Self-AttentionComments: Accepted to AISTATS 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC)
Abstract: Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$ $\textit{layer}$ $\textit{learn}$ $\textit{from}$ $\textit{next-token}$ $\textit{prediction?}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token. $\textbf{(2)}$ $\textbf{Soft}$ $\textbf{composition:}$ It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.
- [1226] arXiv:2403.08103 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Contextual Clarity: Generating Sentences with Transformer Models using Context-Reverso DataSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In the age of information abundance, the ability to provide users with contextually relevant and concise information is crucial. Keyword in Context (KIC) generation is a task that plays a vital role in and generation applications, such as search engines, personal assistants, and content summarization. In this paper, we present a novel approach to generating unambiguous and brief sentence-contexts for given keywords using the T5 transformer model, leveraging data obtained from the Context-Reverso API. The code is available at this https URL .
- [1227] arXiv:2403.08111 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: AI-Assisted Causal Pathway Diagram for Human-Centered DesignJournal-ref: In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11-16, 2024, Honolulu, HI, USASubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: This paper explores the integration of causal pathway diagrams (CPD) into human-centered design (HCD), investigating how these diagrams can enhance the early stages of the design process. A dedicated CPD plugin for the online collaborative whiteboard platform Miro was developed to streamline diagram creation and offer real-time AI-driven guidance. Through a user study with designers (N=20), we found that CPD's branching and its emphasis on causal connections supported both divergent and convergent processes during design. CPD can also facilitate communication among stakeholders. Additionally, we found our plugin significantly reduces designers' cognitive workload and increases their creativity during brainstorming, highlighting the implications of AI-assisted tools in supporting creative work and evidence-based designs.
- [1228] arXiv:2403.08115 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Legally Binding but Unfair? Towards Assessing Fairness of Privacy PoliciesComments: Accepted at IWSPA 2024Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Privacy policies are expected to inform data subjects about their data protection rights and should explain the data controller's data management practices. Privacy policies only fulfill their purpose, if they are correctly interpreted, understood, and trusted by the data subject. This implies that a privacy policy is written in a fair way, e.g., it does not use polarizing terms, does not require a certain education, or does not assume a particular social background. We outline our approach to assessing fairness in privacy policies. We identify from fundamental legal sources and fairness research, how the dimensions informational fairness, representational fairness and ethics / morality are related to privacy policies. We propose options to automatically assess policies in these fairness dimensions, based on text statistics, linguistic methods and artificial intelligence. We conduct initial experiments with German privacy policies to provide evidence that our approach is applicable. Our experiments indicate that there are issues in all three dimensions of fairness. This is important, as future privacy policies may be used in a corpus for legal artificial intelligence models.
- [1229] arXiv:2403.08118 (cross-list from stat.ME) [ pdf , ps , html , other ]
-
Title: Characterising harmful data sources when constructing multi-fidelity surrogate modelsSubjects: Methodology (stat.ME) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: Surrogate modelling techniques have seen growing attention in recent years when applied to both modelling and optimisation of industrial design problems. These techniques are highly relevant when assessing the performance of a particular design carries a high cost, as the overall cost can be mitigated via the construction of a model to be queried in lieu of the available high-cost source. The construction of these models can sometimes employ other sources of information which are both cheaper and less accurate. The existence of these sources however poses the question of which sources should be used when constructing a model. Recent studies have attempted to characterise harmful data sources to guide practitioners in choosing when to ignore a certain source. These studies have done so in a synthetic setting, characterising sources using a large amount of data that is not available in practice. Some of these studies have also been shown to potentially suffer from bias in the benchmarks used in the analysis. In this study, we present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. We employ recently developed benchmark filtering techniques to conduct a bias-free assessment, providing objectively varied benchmark suites of different sizes for future research. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used and use this analysis to provide guidelines that can be used in an applied industrial setting.
- [1230] arXiv:2403.08124 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Independence Criterion in Machine Unlearning of Features and LabelsComments: 10 pages, 1 figureSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: This work delves into the complexities of machine unlearning in the face of distributional shifts, particularly focusing on the challenges posed by non-uniform feature and label removal. With the advent of regulations like the GDPR emphasizing data privacy and the right to be forgotten, machine learning models face the daunting task of unlearning sensitive information without compromising their integrity or performance. Our research introduces a novel approach that leverages influence functions and principles of distributional independence to address these challenges. By proposing a comprehensive framework for machine unlearning, we aim to ensure privacy protection while maintaining model performance and adaptability across varying distributions. Our method not only facilitates efficient data removal but also dynamically adjusts the model to preserve its generalization capabilities. Through extensive experimentation, we demonstrate the efficacy of our approach in scenarios characterized by significant distributional shifts, making substantial contributions to the field of machine unlearning. This research paves the way for developing more resilient and adaptable unlearning techniques, ensuring models remain robust and accurate in the dynamic landscape of data privacy and machine learning.
- [1231] arXiv:2403.08133 (cross-list from eess.SP) [ pdf , ps , html , other ]
-
Title: Physics-Inspired Deep Learning Anti-Aliasing Framework in Efficient Channel State FeedbackSubjects: Signal Processing (eess.SP) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Abstract: Acquiring downlink channel state information (CSI) at the base station is vital for optimizing performance in massive Multiple input multiple output (MIMO) Frequency-Division Duplexing (FDD) systems. While deep learning architectures have been successful in facilitating UE-side CSI feedback and gNB-side recovery, the undersampling issue prior to CSI feedback is often overlooked. This issue, which arises from low density pilot placement in current standards, results in significant aliasing effects in outdoor channels and consequently limits CSI recovery performance. To this end, this work introduces a new CSI upsampling framework at the gNB as a post-processing solution to address the gaps caused by undersampling. Leveraging the physical principles of discrete Fourier transform shifting theorem and multipath reciprocity, our framework effectively uses uplink CSI to mitigate aliasing effects. We further develop a learning-based method that integrates the proposed algorithm with the Iterative Shrinkage-Thresholding Algorithm Net (ISTA-Net) architecture, enhancing our approach for non-uniform sampling recovery. Our numerical results show that both our rule-based and deep learning methods significantly outperform traditional interpolation techniques and current state-of-the-art approaches in terms of performance.
- [1232] arXiv:2403.08136 (cross-list from cs.LO) [ pdf , ps , other ]
-
Title: RoboCertProb: Property Specification for Probabilistic RoboChart ModelsComments: 24 pages, 10 figures, 4 tables, submitted to the International Journal on Software and Systems Modeling (SoSyM)Subjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI)
Abstract: RoboChart is a core notation in the RoboStar framework which brings modern modelling and formal verification technologies into software engineering for robotics. It is a timed and probabilistic domain-specific language for robotics and provides a UML-like architectural and state machine modelling. This work presents RoboCertProb for specifying quantitative properties of probabilistic robotic systems modelled in RoboChart. RoboCertProb's semantics is based on PCTL*. To interpret RoboCertProb over RoboChart models, we give a Markov semantics (DTMCs and MDPs) to RoboChart, derived from its existing transformation semantics to the PRISM language. In addition to property specification, RoboCertProb also entitles us to configure loose constants and unspecified functions and operations in RoboChart models. It allows us to set up environmental inputs to verify reactive probabilistic systems not directly supported in probabilistic model checkers like PRISM because they employ a closed-world assumption. We implement RoboCertProb in an accompanying tool of RoboChart, RoboTool, for specifying properties and automatically generating PRISM properties from them to formally verify RoboChart models using PRISM. We have used it to analyse the behaviour of software controllers for two real robots: an industrial painting robot and an agricultural robot for treating plants with UV lights.
- [1233] arXiv:2403.08137 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: From Paper to Card: Transforming Design Implications with Generative AIJournal-ref: In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11-16, 2024, Honolulu, HI, USA. ACM, New York, NY, USASubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Communicating design implications is common within the HCI community when publishing academic papers, yet these papers are rarely read and used by designers. One solution is to use design cards as a form of translational resource that communicates valuable insights from papers in a more digestible and accessible format to assist in design processes. However, creating design cards can be time-consuming, and authors may lack the resources/know-how to produce cards. Through an iterative design process, we built a system that helps create design cards from academic papers using an LLM and text-to-image model. Our evaluation with designers (N=21) and authors of selected papers (N=12) revealed that designers perceived the design implications from our design cards as more inspiring and generative, compared to reading original paper texts, and the authors viewed our system as an effective way of communicating their design implications. We also propose future enhancements for AI-generated design cards.
- [1234] arXiv:2403.08151 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Measuring the Energy Consumption and Efficiency of Deep Neural Networks: An Empirical Analysis and Design RecommendationsCharles Edison Tripp , Jordan Perr-Sauer , Jamil Gafur , Amabarish Nag , Avi Purkayastha , Sagi Zisman , Erik A. BensenComments: 25 pages, 8 figures, for associated dataset see this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Addressing the so-called ``Red-AI'' trend of rising energy consumption by large-scale neural networks, this study investigates the actual energy consumption, as measured by node-level watt-meters, of training various fully connected neural network architectures. We introduce the BUTTER-E dataset, an augmentation to the BUTTER Empirical Deep Learning dataset, containing energy consumption and performance data from 63,527 individual experimental runs spanning 30,582 distinct configurations: 13 datasets, 20 sizes (number of trainable parameters), 8 network ``shapes'', and 14 depths on both CPU and GPU hardware collected using node-level watt-meters. This dataset reveals the complex relationship between dataset size, network structure, and energy use, and highlights the impact of cache effects. We propose a straightforward and effective energy model that accounts for network size, computing, and memory hierarchy. Our analysis also uncovers a surprising, hardware-mediated non-linear relationship between energy efficiency and network design, challenging the assumption that reducing the number of parameters or FLOPs is the best way to achieve greater energy efficiency. Highlighting the need for cache-considerate algorithm development, we suggest a combined approach to energy efficient network, algorithm, and hardware design. This work contributes to the fields of sustainable computing and Green AI, offering practical guidance for creating more energy-efficient neural networks and promoting sustainable AI.
- [1235] arXiv:2403.08153 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: The Runtime of Random Local Search on the Generalized Needle ProblemComments: 18 pagesSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
Abstract: In their recent work, C. Doerr and Krejca (Transactions on Evolutionary Computation, 2023) proved upper bounds on the expected runtime of the randomized local search heuristic on generalized Needle functions. Based on these upper bounds, they deduce in a not fully rigorous manner a drastic influence of the needle radius $k$ on the runtime.
In this short article, we add the missing lower bound necessary to determine the influence of parameter $k$ on the runtime. To this aim, we derive an exact description of the expected runtime, which also significantly improves the upper bound given by C. Doerr and Krejca. We also describe asymptotic estimates of the expected runtime. - [1236] arXiv:2403.08161 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LAFS: Landmark-based Facial Self-supervised Learning for Face RecognitionComments: accepted to CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
- [1237] arXiv:2403.08174 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Rethinking Loss Functions for Fact VerificationComments: Accepted to EACL 2024 (short paper). The souce code is available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We explore loss functions for fact verification in the FEVER shared task. While the cross-entropy loss is a standard objective for training verdict predictors, it fails to capture the heterogeneity among the FEVER verdict classes. In this paper, we develop two task-specific objectives tailored to FEVER. Experimental results confirm that the proposed objective functions outperform the standard cross-entropy. Performance is further improved when these objectives are combined with simple class weighting, which effectively overcomes the imbalance in the training data. The souce code is available at this https URL
- [1238] arXiv:2403.08197 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: PAGE: Domain-Incremental Adaptation with Past-Agnostic Generative Replay for Smart HealthcareComments: 30 pages, 7 figures. arXiv admin note: text overlap with arXiv:2305.05738Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We propose PAGE, a domain-incremental adaptation strategy with past-agnostic generative replay for smart healthcare. PAGE enables generative replay without the aid of any preserved data or information from prior domains. When adapting to a new domain, it exploits real data from the new distribution and the current model to generate synthetic data that retain the learned knowledge of previous domains. By replaying the synthetic data with the new real data during training, PAGE achieves a good balance between domain adaptation and knowledge retention. In addition, we incorporate an extended inductive conformal prediction (EICP) method into PAGE to produce a confidence score and a credibility value for each detection result. This makes the predictions interpretable and provides statistical guarantees for disease detection in smart healthcare applications. We demonstrate PAGE's effectiveness in domain-incremental disease detection with three distinct disease datasets collected from commercially available WMSs. PAGE achieves highly competitive performance against state-of-the-art with superior scalability, data privacy, and feasibility. Furthermore, PAGE can enable up to 75% reduction in clinical workload with the help of EICP.
- [1239] arXiv:2403.08199 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Submodular Peripteral NetworksComments: PreprintSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Submodular functions, crucial for various applications, often lack practical learning methods for their acquisition. Seemingly unrelated, learning a scaling from oracles offering graded pairwise preferences (GPC) is underexplored, despite a rich history in psychometrics. In this paper, we introduce deep submodular peripteral networks (DSPNs), a novel parametric family of submodular functions, and methods for their training using a contrastive-learning inspired GPC-ready strategy to connect and then tackle both of the above challenges. We introduce newly devised GPC-style "peripteral" loss which leverages numerically graded relationships between pairs of objects (sets in our case). Unlike traditional contrastive learning, our method utilizes graded comparisons, extracting more nuanced information than just binary-outcome comparisons, and contrasts sets of any size (not just two). We also define a novel suite of automatic sampling strategies for training, including active-learning inspired submodular feedback. We demonstrate DSPNs' efficacy in learning submodularity from a costly target submodular function showing superiority in downstream tasks such as experimental design and streaming applications.
- [1240] arXiv:2403.08211 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Large Language Models are Contrastive ReasonersSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Prompting methods play a crucial role in enhancing the capabilities of pre-trained large language models (LLMs). We explore how contrastive prompting (CP) significantly improves the ability of large language models to perform complex reasoning. We demonstrate that LLMs are decent contrastive reasoners by simply adding "Let's give a correct and a wrong answer." before LLMs provide answers. Experiments on two large language models show that zero-shot contrastive prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks without any hand-crafted few-shot examples, such as increasing the accuracy on GSM8K from 35.9% to 88.8% and AQUA-RAT from 41.3% to 62.2% with the state-of-the-art GPT-4 model. Our method not only surpasses zero-shot CoT and few-shot CoT in most arithmetic and commonsense reasoning tasks but also can seamlessly integrate with existing prompting methods, resulting in improved or comparable results when compared to state-of-the-art methods. Our code is available at this https URL
- [1241] arXiv:2403.08214 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: P2LHAP:Wearable sensor-based human activity recognition, segmentation and forecast through Patch-to-Label Seq2Seq TransformerSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Traditional deep learning methods struggle to simultaneously segment, recognize, and forecast human activities from sensor data. This limits their usefulness in many fields such as healthcare and assisted living, where real-time understanding of ongoing and upcoming activities is crucial. This paper introduces P2LHAP, a novel Patch-to-Label Seq2Seq framework that tackles all three tasks in a efficient single-task model. P2LHAP divides sensor data streams into a sequence of "patches", served as input tokens, and outputs a sequence of patch-level activity labels including the predicted future activities. A unique smoothing technique based on surrounding patch labels, is proposed to identify activity boundaries accurately. Additionally, P2LHAP learns patch-level representation by sensor signal channel-independent Transformer encoders and decoders. All channels share embedding and Transformer weights across all sequences. Evaluated on three public datasets, P2LHAP significantly outperforms the state-of-the-art in all three tasks, demonstrating its effectiveness and potential for real-world applications.
- [1242] arXiv:2403.08215 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LIX: Implicitly Infusing Spatial Geometric Prior Knowledge into Visual Semantic Segmentation for Autonomous DrivingComments: 13 pages, 4 figures, 5 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Despite the impressive performance achieved by data-fusion networks with duplex encoders for visual semantic segmentation, they become ineffective when spatial geometric data are not available. Implicitly infusing the spatial geometric prior knowledge acquired by a duplex-encoder teacher model into a single-encoder student model is a practical, albeit less explored research avenue. This paper delves into this topic and resorts to knowledge distillation approaches to address this problem. We introduce the Learning to Infuse "X" (LIX) framework, with novel contributions in both logit distillation and feature distillation aspects. We present a mathematical proof that underscores the limitation of using a single fixed weight in decoupled knowledge distillation and introduce a logit-wise dynamic weight controller as a solution to this issue. Furthermore, we develop an adaptively-recalibrated feature distillation algorithm, including two technical novelties: feature recalibration via kernel regression and in-depth feature consistency quantification via centered kernel alignment. Extensive experiments conducted with intermediate-fusion and late-fusion networks across various public datasets provide both quantitative and qualitative evaluations, demonstrating the superior performance of our LIX framework when compared to other state-of-the-art approaches.
- [1243] arXiv:2403.08222 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Robust Decision Aggregation with Adversarial ExpertsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We consider a binary decision aggregation problem in the presence of both truthful and adversarial experts. The truthful experts will report their private signals truthfully with proper incentive, while the adversarial experts can report arbitrarily. The decision maker needs to design a robust aggregator to forecast the true state of the world based on the reports of experts. The decision maker does not know the specific information structure, which is a joint distribution of signals, states, and strategies of adversarial experts. We want to find the optimal aggregator minimizing regret under the worst information structure. The regret is defined by the difference in expected loss between the aggregator and a benchmark who makes the optimal decision given the joint distribution and reports of truthful experts.
We prove that when the truthful experts are symmetric and adversarial experts are not too numerous, the truncated mean is optimal, which means that we remove some lowest reports and highest reports and take averaging among the left reports. Moreover, for many settings, the optimal aggregators are in the family of piecewise linear functions. The regret is independent of the total number of experts but only depends on the ratio of adversaries. We evaluate our aggregators by numerical experiment in an ensemble learning task. We also obtain some negative results for the aggregation problem with adversarial experts under some more general information structures and experts' report space. - [1244] arXiv:2403.08238 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: A Novel Feature Learning-based Bio-inspired Neural Network for Real-time Collision-free Rescue of Multi-Robot SystemsComments: This paper is accepted to publish in IEEE Transactions on Industrial ElectronicsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Natural disasters and urban accidents drive the demand for rescue robots to provide safer, faster, and more efficient rescue trajectories. In this paper, a feature learning-based bio-inspired neural network (FLBBINN) is proposed to quickly generate a heuristic rescue path in complex and dynamic environments, as traditional approaches usually cannot provide a satisfactory solution to real-time responses to sudden environmental changes. The neurodynamic model is incorporated into the feature learning method that can use environmental information to improve path planning strategies. Task assignment and collision-free rescue trajectory are generated through robot poses and the dynamic landscape of neural activity. A dual-channel scale filter, a neural activity channel, and a secondary distance fusion are employed to extract and filter feature neurons. After completion of the feature learning process, a neurodynamics-based feature matrix is established to quickly generate the new heuristic rescue paths with parameter-driven topological adaptability. The proposed FLBBINN aims to reduce the computational complexity of the neural network-based approach and enable the feature learning method to achieve real-time responses to environmental changes. Several simulations and experiments have been conducted to evaluate the performance of the proposed FLBBINN. The results show that the proposed FLBBINN would significantly improve the speed, efficiency, and optimality for rescue operations.
- [1245] arXiv:2403.08251 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Emergence of Social Norms in Large Language Model-based Agent SocietiesSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: The emergence of social norms has attracted much interest in a wide array of disciplines, ranging from social science and cognitive science to artificial intelligence. In this paper, we propose the first generative agent architecture that empowers the emergence of social norms within a population of large language model-based agents. Our architecture, named CRSEC, consists of four modules: Creation & Representation, Spreading, Evaluation, and Compliance. Our architecture addresses several important aspects of the emergent processes all in one: (i) where social norms come from, (ii) how they are formally represented, (iii) how they spread through agents' communications and observations, (iv) how they are examined with a sanity check and synthesized in the long term, and (v) how they are incorporated into agents' planning and actions. Our experiments deployed in the Smallville sandbox game environment demonstrate the capability of our architecture to establish social norms and reduce social conflicts within large language model-based multi-agent systems. The positive outcomes of our human evaluation, conducted with 30 evaluators, further affirm the effectiveness of our approach.
- [1246] arXiv:2403.08261 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CoroNetGAN: Controlled Pruning of GANs via HypernetworksAman Kumar , Khushboo Anand , Shubham Mandloi , Ashutosh Mishra , Avinash Thakur , Neeraj Kasera , Prathosh A PSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Abstract: Generative Adversarial Networks (GANs) have proven to exhibit remarkable performance and are widely used across many generative computer vision applications. However, the unprecedented demand for the deployment of GANs on resource-constrained edge devices still poses a challenge due to huge number of parameters involved in the generation process. This has led to focused attention on the area of compressing GANs. Most of the existing works use knowledge distillation with the overhead of teacher dependency. Moreover, there is no ability to control the degree of compression in these methods. Hence, we propose CoroNet-GAN for compressing GAN using the combined strength of differentiable pruning method via hypernetworks. The proposed method provides the advantage of performing controllable compression while training along with reducing training time by a substantial factor. Experiments have been done on various conditional GAN architectures (Pix2Pix and CycleGAN) to signify the effectiveness of our approach on multiple benchmark datasets such as Edges-to-Shoes, Horse-to-Zebra and Summer-to-Winter. The results obtained illustrate that our approach succeeds to outperform the baselines on Zebra-to-Horse and Summer-to-Winter achieving the best FID score of 32.3 and 72.3 respectively, yielding high-fidelity images across all the datasets. Additionally, our approach also outperforms the state-of-the-art methods in achieving better inference time on various smart-phone chipsets and data-types making it a feasible solution for deployment on edge devices.
- [1247] arXiv:2403.08264 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: GPT, Ontology, and CAABAC: A Tripartite Personalized Access Control Model Anchored by Compliance, Context and AttributeSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: As digital healthcare evolves, the security of electronic health records (EHR) becomes increasingly crucial. This study presents the GPT-Onto-CAABAC framework, integrating Generative Pretrained Transformer (GPT), medical-legal ontologies and Context-Aware Attribute-Based Access Control (CAABAC) to enhance EHR access security. Unlike traditional models, GPT-Onto-CAABAC dynamically interprets policies and adapts to changing healthcare and legal environments, offering customized access control solutions. Through empirical evaluation, this framework is shown to be effective in improving EHR security by accurately aligning access decisions with complex regulatory and situational requirements. The findings suggest its broader applicability in sectors where access control must meet stringent compliance and adaptability standards.
- [1248] arXiv:2403.08265 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Random Search as a Baseline for Sparse Neural Network Architecture SearchSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Sparse neural networks have shown similar or better generalization performance than their dense counterparts while having higher parameter efficiency. This has motivated a number of works to learn or search for high performing sparse networks. While reports of task performance or efficiency gains are impressive, standard baselines are lacking leading to poor comparability and unreliable reproducibility across methods. In this work, we propose Random Search as a baseline algorithm for finding good sparse configurations and study its performance. We apply Random Search on the node space of an overparameterized network with the goal of finding better initialized sparse sub-networks that are positioned more advantageously in the loss landscape. We record the post-training performances of the found sparse networks and at various levels of sparsity, and compare against both their fully connected parent networks and random sparse configurations at the same sparsity levels. First, we demonstrate performance at different levels of sparsity and highlight that a significant level of performance can still be preserved even when the network is highly sparse. Second, we observe that for this sparse architecture search task, initialized sparse networks found by Random Search neither perform better nor converge more efficiently than their random counterparts. Thus we conclude that Random Search may be viewed as a reasonable neutral baseline for sparsity search methods.
- [1249] arXiv:2403.08271 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Fine-grained ship classification in remote sensing (RS-FGSC) poses a significant challenge due to the high similarity between classes and the limited availability of labeled data, limiting the effectiveness of traditional supervised classification methods. Recent advancements in large pre-trained Vision-Language Models (VLMs) have demonstrated impressive capabilities in few-shot or zero-shot learning, particularly in understanding image content. This study delves into harnessing the potential of VLMs to enhance classification accuracy for unseen ship categories, which holds considerable significance in scenarios with restricted data due to cost or privacy constraints. Directly fine-tuning VLMs for RS-FGSC often encounters the challenge of overfitting the seen classes, resulting in suboptimal generalization to unseen classes, which highlights the difficulty in differentiating complex backgrounds and capturing distinct ship features. To address these issues, we introduce a novel prompt tuning technique that employs a hierarchical, multi-granularity prompt design. Our approach integrates remote sensing ship priors through bias terms, learned from a small trainable network. This strategy enhances the model's generalization capabilities while improving its ability to discern intricate backgrounds and learn discriminative ship features. Furthermore, we contribute to the field by introducing a comprehensive dataset, FGSCM-52, significantly expanding existing datasets with more extensive data and detailed annotations for less common ship classes. Extensive experimental evaluations demonstrate the superiority of our proposed method over current state-of-the-art techniques. The source code will be made publicly available.
- [1250] arXiv:2403.08273 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LiqD: A Dynamic Liquid Level Detection Model under Tricky Small ContainersComments: 7pages, 7 FiguresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In daily life and industrial production, it is crucial to accurately detect changes in liquid level in containers. Traditional contact measurement methods have some limitations, while emerging non-contact image processing technology shows good application prospects. This paper proposes a container dynamic liquid level detection model based on U^2-Net. This model uses the SAM model to generate an initial data set, and then evaluates and filters out high-quality pseudo-label images through the SemiReward framework to build an exclusive data set. The model uses U^2-Net to extract mask images of containers from the data set, and uses morphological processing to compensate for mask defects. Subsequently, the model calculates the grayscale difference between adjacent video frame images at the same position, segments the liquid level change area by setting a difference threshold, and finally uses a lightweight neural network to classify the liquid level state. This approach not only mitigates the impact of intricate surroundings, but also reduces the demand for training data, showing strong robustness and versatility. A large number of experimental results show that the proposed model can effectively detect the dynamic liquid level changes of the liquid in the container, providing a novel and efficient solution for related fields.
- [1251] arXiv:2403.08281 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language ModelsNing Ding , Yulin Chen , Ganqu Cui , Xingtai Lv , Weilin Zhao , Ruobing Xie , Bowen Zhou , Zhiyuan Liu , Maosong SunSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we propose to fuse models that are already highly-specialized directly. The proposed fusing framework, UltraFuser, consists of three distinct specialists that are already sufficiently trained on language, coding, and mathematics. A token-level gating mechanism is introduced to blend the specialists' outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, UltraChat 2, which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains.
- [1252] arXiv:2403.08291 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: CleanAgent: Automating Data Standardization with LLM-based AgentsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Data standardization is a crucial part in data science life cycle. While tools like Pandas offer robust functionalities, their complexity and the manual effort required for customizing code to diverse column types pose significant challenges. Although large language models (LLMs) like ChatGPT have shown promise in automating this process through natural language understanding and code generation, it still demands expert-level programming knowledge and continuous interaction for prompt refinement. To solve these challenges, our key idea is to propose a Python library with declarative, unified APIs for standardizing column types, simplifying the code generation of LLM with concise API calls. We first propose Dataprep.Clean which is written as a component of the Dataprep Library, offers a significant reduction in complexity by enabling the standardization of specific column types with a single line of code. Then we introduce the CleanAgent framework integrating Dataprep.Clean and LLM-based agents to automate the data standardization process. With CleanAgent, data scientists need only provide their requirements once, allowing for a hands-free, automatic standardization process.
- [1253] arXiv:2403.08292 (cross-list from math.NA) [ pdf , ps , html , other ]
-
Title: Weak Collocation Regression for Inferring Stochastic Dynamics with L\'{e}vy NoiseComments: 19 pages, 5 figures, 10 tablesSubjects: Numerical Analysis (math.NA) ; Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
Abstract: With the rapid increase of observational, experimental and simulated data for stochastic systems, tremendous efforts have been devoted to identifying governing laws underlying the evolution of these systems. Despite the broad applications of non-Gaussian fluctuations in numerous physical phenomena, the data-driven approaches to extracting stochastic dynamics with Lévy noise are relatively few. In this work, we propose a Weak Collocation Regression (WCR) to explicitly reveal unknown stochastic dynamical systems, i.e., the Stochastic Differential Equation (SDE) with both $\alpha$-stable Lévy noise and Gaussian noise, from discrete aggregate data. This method utilizes the evolution equation of the probability distribution function, i.e., the Fokker-Planck (FP) equation. With the weak form of the FP equation, the WCR constructs a linear system of unknown parameters where all integrals are evaluated by Monte Carlo method with the observations. Then, the unknown parameters are obtained by a sparse linear regression. For a SDE with Lévy noise, the corresponding FP equation is a partial integro-differential equation (PIDE), which contains nonlocal terms, and is difficult to deal with. The weak form can avoid complicated multiple integrals. Our approach can simultaneously distinguish mixed noise types, even in multi-dimensional problems. Numerical experiments demonstrate that our method is accurate and computationally efficient.
- [1254] arXiv:2403.08293 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at ScaleComments: preprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: A syntactic language model (SLM) incrementally generates a sentence with its syntactic tree in a left-to-right manner. We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts with high parallelism. GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training. It consists of two components, a usual SLM supervised by a uni-directional language modeling loss, and an additional composition model, which induces syntactic parse trees and computes constituent representations, supervised by a bi-directional language modeling loss. We propose a representation surrogate to enable joint parallel training of the two models in a hard-EM fashion. We pre-train GPST on OpenWebText, a corpus with $9$ billion tokens, and demonstrate the superiority of GPST over GPT-2 with a comparable size in numerous tasks covering both language understanding and language generation. Meanwhile, GPST also significantly outperforms existing unsupervised SLMs on left-to-right grammar induction, while holding a substantial acceleration on training.
- [1255] arXiv:2403.08295 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Gemma: Open Models Based on Gemini Research and TechnologyGemma Team : Thomas Mesnard , Cassidy Hardin , Robert Dadashi , Surya Bhupatiraju , Shreya Pathak , Laurent Sifre , Morgane Rivière , Mihir Sanjay Kale , Juliette Love , Pouya Tafti , Léonard Hussenot , Pier Giuseppe Sessa , Aakanksha Chowdhery , Adam Roberts , Aditya Barua , Alex Botev , Alex Castro-Ros , Ambrose Slone , Amélie Héliou , Andrea Tacchetti , Anna Bulanova , Antonia Paterson , Beth Tsai , Bobak Shahriari , Charline Le Lan , Christopher A. Choquette-Choo , Clément Crepy , Daniel Cer , Daphne Ippolito , David Reid , Elena Buchatskaya , Eric Ni , Eric Noland , Geng Yan , George Tucker , George-Christian Muraru , Grigory Rozhdestvenskiy , Henryk Michalewski , Ian Tenney , Ivan Grishchenko , Jacob Austin , James Keeling , Jane Labanowski , Jean-Baptiste Lespiau , Jeff Stanway , Jenny Brennan , Jeremy Chen , Johan Ferret , Justin Chiu , Justin Mao-Jones , Katherine Lee , Kathy Yu , Katie Millican , Lars Lowe Sjoesund , Lisa Lee , Lucas Dixon , Machel Reid , Maciej Mikuła , Mateo Wirth , Michael Sharman , Nikolai Chinaev , Nithum Thain , Olivier Bachem , Oscar Chang , Oscar Wahltinez , Paige Bailey , Paul Michel , Petko Yotov , Rahma Chaabouni , Ramona Comanescu , Reena Jana , Rohan Anil , Ross McIlroy , Ruibo Liu , Ryan Mullins , Samuel L Smith , Sebastian Borgeaud , Sertan Girgin , Sholto Douglas , Shree Pandya , Siamak Shakeri , Soham De , Ted Klimenko , Tom Hennigan , Vlad Feinberg , Wojciech Stokowiec , Yu-hui Chen , Zafarali Ahmed , Zhitao Gong , Tris Warkentin , Ludovic Peran , Minh Giang , Clément Farabet , Oriol Vinyals , Jeff Dean , Koray Kavukcuoglu , Demis Hassabis , Zoubin Ghahramani , Douglas EckSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.
- [1256] arXiv:2403.08299 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: AutoDev: Automated AI-Driven DevelopmentSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the potential capabilities available in an IDE such as building, testing, executing code, git operations, etc. Therefore, they are constrained by their limited capabilities, primarily focusing on suggesting code snippets and file manipulation within a chat-based interface. To fill this gap, we present AutoDev, a fully automated AI-driven software development framework, designed for autonomous planning and execution of intricate software engineering tasks. AutoDev enables users to define complex software engineering objectives, which are assigned to AutoDev's autonomous AI Agents to achieve. These AI agents can perform diverse operations on a codebase, including file editing, retrieval, build processes, execution, testing, and git operations. They also have access to files, compiler output, build and testing logs, static analysis tools, and more. This enables the AI Agents to execute tasks in a fully automated manner with a comprehensive understanding of the contextual information required. Furthermore, AutoDev establishes a secure development environment by confining all operations within Docker containers. This framework incorporates guardrails to ensure user privacy and file security, allowing users to define specific permitted or restricted commands and operations within AutoDev. In our evaluation, we tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment.
- [1257] arXiv:2403.08309 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI FeedbackAng Li , Qiugen Xiao , Peng Cao , Jian Tang , Yi Yuan , Zijie Zhao , Xiaoyuan Chen , Liang Zhang , Xiangyang Li , Kaitong Yang , Weidong Guo , Yukang Gan , Xu Yu , Daniell Wang , Ying ShanComments: 18 pages, 7 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Reinforcement Learning from AI Feedback (RLAIF) has the advantages of shorter annotation cycles and lower costs over Reinforcement Learning from Human Feedback (RLHF), making it highly efficient during the rapid strategy iteration periods of large language model (LLM) training. Using ChatGPT as a labeler to provide feedback on open-domain prompts in RLAIF training, we observe an increase in human evaluators' preference win ratio for model responses, but a decrease in evaluators' satisfaction rate. Analysis suggests that the decrease in satisfaction rate is mainly due to some responses becoming less helpful, particularly in terms of correctness and truthfulness, highlighting practical limitations of basic RLAIF. In this paper, we propose Hybrid Reinforcement Learning from AI Feedback (HRLAIF). This method enhances the accuracy of AI annotations for responses, making the model's helpfulness more robust in training process. Additionally, it employs AI for Red Teaming, further improving the model's harmlessness. Human evaluation results show that HRLAIF inherits the ability of RLAIF to enhance human preference for outcomes at a low cost while also improving the satisfaction rate of responses. Compared to the policy model before Reinforcement Learning (RL), it achieves an increase of 2.08\% in satisfaction rate, effectively addressing the issue of a decrease of 4.58\% in satisfaction rate after basic RLAIF.
- [1258] arXiv:2403.08312 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal LossesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Standard Large Language Models (LLMs) struggle with handling dialogues with long contexts due to efficiency and consistency issues. According to our observation, dialogue contexts are highly structured, and the special token of \textit{End-of-Utterance} (EoU) in dialogues has the potential to aggregate information. We refer to the EoU tokens as ``conversational attention sinks'' (conv-attn sinks). Accordingly, we introduce StreamingDialogue, which compresses long dialogue history into conv-attn sinks with minimal losses, and thus reduces computational complexity quadratically with the number of sinks (i.e., the number of utterances). Current LLMs already demonstrate the ability to handle long context window, e.g., a window size of 200k or more. To this end, by compressing utterances into EoUs, our method has the potential to handle more than 200k of utterances, resulting in a prolonged dialogue learning. In order to minimize information losses from reconstruction after compression, we design two learning strategies of short-memory reconstruction (SMR) and long-memory reactivation (LMR). Our method outperforms strong baselines in dialogue tasks and achieves a 4 $\times$ speedup while reducing memory usage by 18 $\times$ compared to dense attention recomputation.
- [1259] arXiv:2403.08319 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Knowledge Conflicts for LLMs: A SurveySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of knowledge conflicts: context-memory, inter-context, and intra-memory conflict. These conflicts can significantly impact the trustworthiness and performance of LLMs, especially in real-world applications where noise and misinformation are common. By categorizing these conflicts, exploring the causes, examining the behaviors of LLMs under such conflicts, and reviewing available solutions, this survey aims to shed light on strategies for improving the robustness of LLMs, thereby serving as a valuable resource for advancing research in this evolving area.
- [1260] arXiv:2403.08332 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Autoregressive Score Generation for Multi-trait Essay ScoringComments: Accepted at EACL2024 FindingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recently, encoder-only pre-trained models such as BERT have been successfully applied in automated essay scoring (AES) to predict a single overall score. However, studies have yet to explore these models in multi-trait AES, possibly due to the inefficiency of replicating BERT-based models for each trait. Breaking away from the existing sole use of encoder, we propose an autoregressive prediction of multi-trait scores (ArTS), incorporating a decoding process by leveraging the pre-trained T5. Unlike prior regression or classification methods, we redefine AES as a score-generation task, allowing a single model to predict multiple scores. During decoding, the subsequent trait prediction can benefit by conditioning on the preceding trait scores. Experimental results proved the efficacy of ArTS, showing over 5% average improvements in both prompts and traits.
- [1261] arXiv:2403.08333 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Fast Inference of Removal-Based Node InfluenceComments: To be published in the Web Conference 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graph neural networks (GNNs) are widely utilized to capture the information spreading patterns in graphs. While remarkable performance has been achieved, there is a new trending topic of evaluating node influence. We propose a new method of evaluating node influence, which measures the prediction change of a trained GNN model caused by removing a node. A real-world application is, "In the task of predicting Twitter accounts' polarity, had a particular account been removed, how would others' polarity change?". We use the GNN as a surrogate model whose prediction could simulate the change of nodes or edges caused by node removal. Our target is to obtain the influence score for every node, and a straightforward way is to alternately remove every node and apply the trained GNN on the modified graph to generate new predictions. It is reliable but time-consuming, so we need an efficient method. The related lines of work, such as graph adversarial attack and counterfactual explanation, cannot directly satisfy our needs, since their problem settings are different. We propose an efficient, intuitive, and effective method, NOde-Removal-based fAst GNN inference (NORA), which uses the gradient information to approximate the node-removal influence. It only costs one forward propagation and one backpropagation to approximate the influence score for all nodes. Extensive experiments on six datasets and six GNN models verify the effectiveness of NORA. Our code is available at this https URL .
- [1262] arXiv:2403.08335 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Sparsity Principle for Partially Observable Causal Representation LearningDanru Xu , Dingling Yao , Sébastien Lachapelle , Perouz Taslakian , Julius von Kügelgen , Francesco Locatello , Sara MagliacaneComments: 33 pages, 18 figures, 9 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Causal representation learning aims at identifying high-level causal variables from perceptual data. Most methods assume that all latent causal variables are captured in the high-dimensional observations. We instead consider a partially observed setting, in which each measurement only provides information about a subset of the underlying causal state. Prior work has studied this setting with multiple domains or views, each depending on a fixed subset of latents. Here, we focus on learning from unpaired observations from a dataset with an instance-dependent partial observability pattern. Our main contribution is to establish two identifiability results for this setting: one for linear mixing functions without parametric assumptions on the underlying causal model, and one for piecewise linear mixing functions with Gaussian latent causal variables. Based on these insights, we propose two methods for estimating the underlying causal variables by enforcing sparsity in the inferred representation. Experiments on different simulated datasets and established benchmarks highlight the effectiveness of our approach in recovering the ground-truth latents.
- [1263] arXiv:2403.08337 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: LLM-Assisted Light: Leveraging Large Language Model Capabilities for Human-Mimetic Traffic Signal Control in Complex Urban EnvironmentsComments: 15 pagesSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Traffic congestion in metropolitan areas presents a formidable challenge with far-reaching economic, environmental, and societal ramifications. Therefore, effective congestion management is imperative, with traffic signal control (TSC) systems being pivotal in this endeavor. Conventional TSC systems, designed upon rule-based algorithms or reinforcement learning (RL), frequently exhibit deficiencies in managing the complexities and variabilities of urban traffic flows, constrained by their limited capacity for adaptation to unfamiliar scenarios. In response to these limitations, this work introduces an innovative approach that integrates Large Language Models (LLMs) into TSC, harnessing their advanced reasoning and decision-making faculties. Specifically, a hybrid framework that augments LLMs with a suite of perception and decision-making tools is proposed, facilitating the interrogation of both the static and dynamic traffic information. This design places the LLM at the center of the decision-making process, combining external traffic data with established TSC methods. Moreover, a simulation platform is developed to corroborate the efficacy of the proposed framework. The findings from our simulations attest to the system's adeptness in adjusting to a multiplicity of traffic environments without the need for additional training. Notably, in cases of Sensor Outage (SO), our approach surpasses conventional RL-based systems by reducing the average waiting time by $20.4\%$. This research signifies a notable advance in TSC strategies and paves the way for the integration of LLMs into real-world, dynamic scenarios, highlighting their potential to revolutionize traffic management. The related code is available at \href{ this https URL }{ this https URL }.
- [1264] arXiv:2403.08352 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methodsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Abstract: Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. Automated data augmentation methods aim to automate the process. State-of-the-art approaches typically rely on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. We present extensive discussion of techniques for realizing each of the major subtasks of the data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.
- [1265] arXiv:2403.08364 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Decoupled Federated Learning on Long-Tailed and Non-IID data with Feature StatisticsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Federated learning is designed to enhance data security and privacy, but faces challenges when dealing with heterogeneous data in long-tailed and non-IID distributions. This paper explores an overlooked scenario where tail classes are sparsely distributed over a few clients, causing the models trained with these classes to have a lower probability of being selected during client aggregation, leading to slower convergence rates and poorer model performance. To address this issue, we propose a two-stage Decoupled Federated learning framework using Feature Statistics (DFL-FS). In the first stage, the server estimates the client's class coverage distributions through masked local feature statistics clustering to select models for aggregation to accelerate convergence and enhance feature learning without privacy leakage. In the second stage, DFL-FS employs federated feature regeneration based on global feature statistics and utilizes resampling and weighted covariance to calibrate the global classifier to enhance the model's adaptability to long-tailed data distributions. We conducted experiments on CIFAR10-LT and CIFAR100-LT datasets with various long-tailed rates. The results demonstrate that our method outperforms state-of-the-art methods in both accuracy and convergence rate.
- [1266] arXiv:2403.08370 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SMART: Submodular Data Mixture Strategy for Instruction TuningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Instruction Tuning involves finetuning a language model on a collection of instruction-formatted datasets in order to enhance the generalizability of the model to unseen tasks. Studies have shown the importance of balancing different task proportions during finetuning, but finding the right balance remains challenging. Unfortunately, there's currently no systematic method beyond manual tuning or relying on practitioners' intuition. In this paper, we introduce SMART (Submodular data Mixture strAtegy for instRuction Tuning) - a novel data mixture strategy which makes use of a submodular function to assign importance scores to tasks which are then used to determine the mixture weights. Given a fine-tuning budget, SMART redistributes the budget among tasks and selects non-redundant samples from each task. Experimental results demonstrate that SMART significantly outperforms traditional methods such as examples proportional mixing and equal mixing. Furthermore, SMART facilitates the creation of data mixtures based on a few representative subsets of tasks alone and through task pruning analysis, we reveal that in a limited budget setting, allocating budget among a subset of representative tasks yields superior performance compared to distributing the budget among all tasks. The code for reproducing our results is open-sourced at this https URL .
- [1267] arXiv:2403.08375 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: Translating between SQL Dialects for Cloud MigrationSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
Abstract: Migrations of systems from on-site premises to the cloud has been a fundamental endeavor by many industrial institutions. A crucial component of such cloud migrations is the transition of databases to be hosted online. In this work, we consider the difficulties of this migration for SQL databases. While SQL is one of the prominent methods for storing database procedures, there are a plethora of different SQL dialects (e.g., MySQL, Postgres, etc.) which can complicate migrations when the on-premise SQL dialect differs to the dialect hosted on the cloud. Tools exist by common cloud provides such as AWS and Azure to aid in translating between dialects in order to mitigate the majority of the difficulties. However, these tools do not successfully translate $100\%$ of the code. Consequently, software engineers must manually convert the remainder of the untranslated database. For large organizations, this task quickly becomes intractable and so more innovative solutions are required. We consider this challenge a novel yet vital industrial research problem for any large corporation that is considering cloud migrations. Furthermore, we introduce potential avenues of research to tackle this challenge that have yielded promising preliminary results.
- [1268] arXiv:2403.08414 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Causal Graph Neural Networks for Wildfire Danger PredictionShan Zhao , Ioannis Prapas , Ilektra Karasante , Zhitong Xiong , Ioannis Papoutsis , Gustau Camps-Valls , Xiao Xiang ZhuComments: Accepted by ICLR 2024 Machine Learning for Remote Sensing (ML4RS) WorkshopSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Wildfire forecasting is notoriously hard due to the complex interplay of different factors such as weather conditions, vegetation types and human activities. Deep learning models show promise in dealing with this complexity by learning directly from data. However, to inform critical decision making, we argue that we need models that are right for the right reasons; that is, the implicit rules learned should be grounded by the underlying processes driving wildfires. In that direction, we propose integrating causality with Graph Neural Networks (GNNs) that explicitly model the causal mechanism among complex variables via graph learning. The causal adjacency matrix considers the synergistic effect among variables and removes the spurious links from highly correlated impacts. Our methodology's effectiveness is demonstrated through superior performance forecasting wildfire patterns in the European boreal and mediterranean biome. The gain is especially prominent in a highly imbalanced dataset, showcasing an enhanced robustness of the model to adapt to regime shifts in functional relationships. Furthermore, SHAP values from our trained model further enhance our understanding of the model's inner workings.
- [1269] arXiv:2403.08424 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Tastle: Distract Large Language Models for Automatic Jailbreak AttackSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) have achieved significant advances in recent days. Extensive efforts have been made before the public release of LLMs to align their behaviors with human values. The primary goal of alignment is to ensure their helpfulness, honesty and harmlessness. However, even meticulously aligned LLMs remain vulnerable to malicious manipulations such as jailbreaking, leading to unintended behaviors. The jailbreak is to intentionally develop a malicious prompt that escapes from the LLM security restrictions to produce uncensored detrimental contents. Previous works explore different jailbreak methods for red teaming LLMs, yet they encounter challenges regarding to effectiveness and scalability. In this work, we propose Tastle, a novel black-box jailbreak framework for automated red teaming of LLMs. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs, motivated by the research about the distractibility and over-confidence phenomenon of LLMs. Extensive experiments of jailbreaking both open-source and proprietary LLMs demonstrate the superiority of our framework in terms of effectiveness, scalability and transferability. We also evaluate the effectiveness of existing jailbreak defense methods against our attack and highlight the crucial need to develop more effective and practical defense strategies.
- [1270] arXiv:2403.08426 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Language-Driven Visual Consensus for Zero-Shot Semantic SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The pre-trained vision-language model, exemplified by CLIP, advances zero-shot semantic segmentation by aligning visual features with class embeddings through a transformer decoder to generate semantic masks. Despite its effectiveness, prevailing methods within this paradigm encounter challenges, including overfitting on seen classes and small fragmentation in masks. To mitigate these issues, we propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.Specifically, we leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings. Moreover, to circumvent noisy alignments from the vision part due to its redundant nature, we introduce route attention into self-attention for finding visual consensus, thereby enhancing semantic consistency within the same object. Equipped with a vision-language prompting strategy, our approach significantly boosts the generalization capacity of segmentation models for unseen classes. Experimental results underscore the effectiveness of our approach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on the COCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.
- [1271] arXiv:2403.08429 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Software Vulnerability and Functionality Assessment using LLMsComments: 4 pages, accepted to NLBSE'24Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: While code review is central to the software development process, it can be tedious and expensive to carry out. In this paper, we investigate whether and how Large Language Models (LLMs) can aid with code reviews. Our investigation focuses on two tasks that we argue are fundamental to good reviews: (i) flagging code with security vulnerabilities and (ii) performing software functionality validation, i.e., ensuring that code meets its intended functionality. To test performance on both tasks, we use zero-shot and chain-of-thought prompting to obtain final ``approve or reject'' recommendations. As data, we employ seminal code generation datasets (HumanEval and MBPP) along with expert-written code snippets with security vulnerabilities from the Common Weakness Enumeration (CWE). Our experiments consider a mixture of three proprietary models from OpenAI and smaller open-source LLMs. We find that the former outperforms the latter by a large margin. Motivated by promising results, we finally ask our models to provide detailed descriptions of security vulnerabilities. Results show that 36.7% of LLM-generated descriptions can be associated with true CWE vulnerabilities.
- [1272] arXiv:2403.08430 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Search-based Optimisation of LLM Learning Shots for Story Point EstimationComments: 6 pages, Accepted at SSBSE'23 NIER TrackJournal-ref: Search-Based Software Engineering. SSBSE 2023. Lecture Notes in Computer Science, vol 14415. SpringerSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: One of the ways Large Language Models (LLMs) are used to perform machine learning tasks is to provide them with a few examples before asking them to produce a prediction. This is a meta-learning process known as few-shot learning. In this paper, we use available Search-Based methods to optimise the number and combination of examples that can improve an LLM's estimation performance, when it is used to estimate story points for new agile tasks. Our preliminary results show that our SBSE technique improves the estimation performance of the LLM by 59.34% on average (in terms of mean absolute error of the estimation) over three datasets against a zero-shot setting.
- [1273] arXiv:2403.08438 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Reproducibility and Geometric Intrinsic Dimensionality: An Investigation on Graph Neural Network ResearchComments: 39 pages, 9 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Difficulties in replication and reproducibility of empirical evidences in machine learning research have become a prominent topic in recent years. Ensuring that machine learning research results are sound and reliable requires reproducibility, which verifies the reliability of research findings using the same code and data. This promotes open and accessible research, robust experimental workflows, and the rapid integration of new findings. Evaluating the degree to which research publications support these different aspects of reproducibility is one goal of the present work. For this we introduce an ontology of reproducibility in machine learning and apply it to methods for graph neural networks. Building on these efforts we turn towards another critical challenge in machine learning, namely the curse of dimensionality, which poses challenges in data collection, representation, and analysis, making it harder to find representative data and impeding the training and inference processes. Using the closely linked concept of geometric intrinsic dimension we investigate to which extend the used machine learning models are influenced by the intrinsic dimension of the data sets they are trained on.
- [1274] arXiv:2403.08502 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Masked Generative Story Transformer with Character Guidance and Caption AugmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Story Visualization (SV) is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately, to improve the rendering of characters. On the contrary, we embrace a completely parallel transformer-based approach, exclusively relying on Cross-Attention with past and future captions to achieve consistency. Additionally, we propose a Character Guidance technique to focus on the generation of characters in an implicit manner, by forming a combination of text-conditional and character-conditional logits in the logit space. We also employ a caption-augmentation technique, carried out by a Large Language Model (LLM), to enhance the robustness of our approach. The combination of these methods culminates into state-of-the-art (SOTA) results over various metrics in the most prominent SV benchmark (Pororo-SV), attained with constraint resources while achieving superior computational complexity compared to previous arts. The validity of our quantitative results is supported by a human survey.
- [1275] arXiv:2403.08505 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Content-aware Masked Image Modeling Transformer for Stereo Image CompressionXinjie Zhang , Shenyuan Gao , Zhening Liu , Jiawei Shao , Xingtong Ge , Dailan He , Tongda Xu , Yan Wang , Jun ZhangSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract: Existing learning-based stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed.
- [1276] arXiv:2403.08506 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated LearningJournal-ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data, and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However, most existing FL methods assume that domain labels are provided during training, and their evaluation imposes explicit constraints on the number of domains, which must strictly match the number of clients. Because of the underutilization of numerous edge devices and additional cross-client domain annotations in the real world, such restrictions may be impractical and involve potential privacy leaks. In this paper, we propose an efficient and novel approach, called Disentangled Prompt Tuning (DiPrompT), a method that tackles the above restrictions by learning adaptive prompts for domain generalization in a distributed manner. Specifically, we first design two types of prompts, i.e., global prompt to capture general knowledge across all clients and domain prompts to capture domain-specific knowledge. They eliminate the restriction on the one-to-one mapping between source domains and local clients. Furthermore, a dynamic query metric is introduced to automatically search the suitable domain label for each sample, which includes two-substep text-image alignments based on prompt tuning without labor-intensive annotation. Extensive experiments on multiple datasets demonstrate that our DiPrompT achieves superior domain generalization performance over state-of-the-art FL methods when domain labels are not provided, and even outperforms many centralized learning methods using domain labels.
- [1277] arXiv:2403.08528 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Pig aggression classification using CNN, Transformers and Recurrent NetworksJunior Silva Souza , Eduardo Bedin , Gabriel Toshio Hirokawa Higa , Newton Loebens , Hemerson PistoriSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The development of techniques that can be used to analyze and detect animal behavior is a crucial activity for the livestock sector, as it is possible to monitor the stress and animal welfare and contributes to decision making in the farm. Thus, the development of applications can assist breeders in making decisions to improve production performance and reduce costs, once the animal behavior is analyzed by humans and this can lead to susceptible errors and time consumption. Aggressiveness in pigs is an example of behavior that is studied to reduce its impact through animal classification and identification. However, this process is laborious and susceptible to errors, which can be reduced through automation by visually classifying videos captured in controlled environment. The captured videos can be used for training and, as a result, for classification through computer vision and artificial intelligence, employing neural network techniques. The main techniques utilized in this study are variants of transformers: STAM, TimeSformer, and ViViT, as well as techniques using convolutions, such as ResNet3D2, Resnet(2+1)D, and CnnLstm. These techniques were employed for pig video classification with the objective of identifying aggressive and non-aggressive behaviors. In this work, various techniques were compared to analyze the contribution of using transformers, in addition to the effectiveness of the convolution technique in video classification. The performance was evaluated using accuracy, precision, and recall. The TimerSformer technique showed the best results in video classification, with median accuracy of 0.729.
- [1278] arXiv:2403.08536 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HOLMES: HOLonym-MEronym based Semantic inspection for Convolutional Image ClassifiersComments: This work has been accepted to be presented to The 1st World Conference on eXplainable Artificial Intelligence (xAI 2023), July 26-28, 2023 - Lisboa, PortugalJournal-ref: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1902. Springer, ChamSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Convolutional Neural Networks (CNNs) are nowadays the model of choice in Computer Vision, thanks to their ability to automatize the feature extraction process in visual tasks. However, the knowledge acquired during training is fully subsymbolic, and hence difficult to understand and explain to end users. In this paper, we propose a new technique called HOLMES (HOLonym-MEronym based Semantic inspection) that decomposes a label into a set of related concepts, and provides component-level explanations for an image classification model. Specifically, HOLMES leverages ontologies, web scraping and transfer learning to automatically construct meronym (parts)-based detectors for a given holonym (class). Then, it produces heatmaps at the meronym level and finally, by probing the holonym CNN with occluded images, it highlights the importance of each part on the classification output. Compared to state-of-the-art saliency methods, HOLMES takes a step further and provides information about both where and what the holonym CNN is looking at, without relying on densely annotated datasets and without forcing concepts to be associated to single computational units. Extensive experimental evaluation on different categories of objects (animals, tools and vehicles) shows the feasibility of our approach. On average, HOLMES explanations include at least two meronyms, and the ablation of a single meronym roughly halves the holonym model confidence. The resulting heatmaps were quantitatively evaluated using the deletion/insertion/preservation curves. All metrics were comparable to those achieved by GradCAM, while offering the advantage of further decomposing the heatmap in human-understandable concepts, thus highlighting both the relevance of meronyms to object classification, as well as HOLMES ability to capture it. The code is available at this https URL .
- [1279] arXiv:2403.08551 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian SplattingXinjie Zhang , Xingtong Ge , Tongda Xu , Dailan He , Yan Wang , Hongwei Qin , Guo Lu , Jing Geng , Jun ZhangSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract: Implicit neural representations (INRs) recently achieved great success in image representation and compression, offering high visual quality and fast rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are available. However, this requirement often hinders their use on low-end devices with limited memory. In response, we propose a groundbreaking paradigm of image representation and compression by 2D Gaussian Splatting, named GaussianImage. We first introduce 2D Gaussian to represent the image, where each Gaussian has 8 parameters including position, covariance and color. Subsequently, we unveil a novel rendering algorithm based on accumulated summation. Remarkably, our method with a minimum of 3$\times$ lower GPU memory usage and 5$\times$ faster fitting time not only rivals INRs (e.g., WIRE, I-NGP) in representation performance, but also delivers a faster rendering speed of 1500-2000 FPS regardless of parameter size. Furthermore, we integrate existing vector quantization technique to build an image codec. Experimental results demonstrate that our codec attains rate-distortion performance comparable to compression-based INRs such as COIN and COIN++, while facilitating decoding speeds of approximately 1000 FPS. Additionally, preliminary proof of concept shows that our codec surpasses COIN and COIN++ in performance when using partial bits-back coding. Code will be available at this https URL .
- [1280] arXiv:2403.08554 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Federated Knowledge Graph Unlearning via Diffusion ModelSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Federated learning (FL) promotes the development and application of artificial intelligence technologies by enabling model sharing and collaboration while safeguarding data privacy. Knowledge graph (KG) embedding representation provides a foundation for knowledge reasoning and applications by mapping entities and relations into vector space. Federated KG embedding enables the utilization of knowledge from diverse client sources while safeguarding the privacy of local data. However, due to demands such as privacy protection and the need to adapt to dynamic data changes, investigations into machine unlearning (MU) have been sparked. However, it is challenging to maintain the performance of KG embedding models while forgetting the influence of specific forgotten data on the model. In this paper, we propose FedDM, a novel framework tailored for machine unlearning in federated knowledge graphs. Leveraging diffusion models, we generate noisy data to sensibly mitigate the influence of specific knowledge on FL models while preserving the overall performance concerning the remaining data. We conduct experimental evaluations on benchmark datasets to assess the efficacy of the proposed model. Extensive experiments demonstrate that FedDM yields promising results in knowledge forgetting.
- [1281] arXiv:2403.08556 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelComments: Project Page: this http URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The generalization of monocular metric depth estimation (MMDE) has been a longstanding challenge. Recent methods made progress by combining relative and metric depth or aligning input image focal length. However, they are still beset by challenges in camera, scene, and data levels: (1) Sensitivity to different cameras; (2) Inconsistent accuracy across scenes; (3) Reliance on massive training data. This paper proposes SM4Depth, a seamless MMDE method, to address all the issues above within a single network. First, we reveal that a consistent field of view (FOV) is the key to resolve ``metric ambiguity'' across cameras, which guides us to propose a more straightforward preprocessing unit. Second, to achieve consistently high accuracy across scenes, we explicitly model the metric scale determination as discretizing the depth interval into bins and propose variation-based unnormalized depth bins. This method bridges the depth gap of diverse scenes by reducing the ambiguity of the conventional metric bin. Third, to reduce the reliance on massive training data, we propose a ``divide and conquer" solution. Instead of estimating directly from the vast solution space, the correct metric bins are estimated from multiple solution sub-spaces for complexity reduction. Finally, with just 150K RGB-D pairs and a consumer-grade GPU for training, SM4Depth achieves state-of-the-art performance on most previously unseen datasets, especially surpassing ZoeDepth and Metric3D on mRI$_\theta$. The code can be found at this https URL .
- [1282] arXiv:2403.08562 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Structural perspective on constraint-based learning of Markov networksComments: AISTATS 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
Abstract: Markov networks are probabilistic graphical models that employ undirected graphs to depict conditional independence relationships among variables. Our focus lies in constraint-based structure learning, which entails learning the undirected graph from data through the execution of conditional independence tests. We establish theoretical limits concerning two critical aspects of constraint-based learning of Markov networks: the number of tests and the sizes of the conditioning sets. These bounds uncover an exciting interplay between the structural properties of the graph and the amount of tests required to learn a Markov network. The starting point of our work is that the graph parameter maximum pairwise connectivity, $\kappa$, that is, the maximum number of vertex-disjoint paths connecting a pair of vertices in the graph, is responsible for the sizes of independence tests required to learn the graph. On one hand, we show that at least one test with the size of the conditioning set at least $\kappa$ is always necessary. On the other hand, we prove that any graph can be learned by performing tests of size at most $\kappa$. This completely resolves the question of the minimum size of conditioning sets required to learn the graph. When it comes to the number of tests, our upper bound on the sizes of conditioning sets implies that every $n$-vertex graph can be learned by at most $n^{\kappa}$ tests with conditioning sets of sizes at most $\kappa$. We show that for any upper bound $q$ on the sizes of the conditioning sets, there exist graphs with $O(n q)$ vertices that require at least $n^{\Omega(\kappa)}$ tests to learn. This lower bound holds even when the treewidth and the maximum degree of the graph are at most $\kappa+2$. On the positive side, we prove that every graph of bounded treewidth can be learned by a polynomial number of tests with conditioning sets of sizes at most $2\kappa$.
- [1283] arXiv:2403.08564 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Non-discrimination Criteria for Generative Language ModelsComments: 14 pages, 5 figures. Submitted to ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT 2024)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Within recent years, generative AI, such as large language models, has undergone rapid development. As these models become increasingly available to the public, concerns arise about perpetuating and amplifying harmful biases in applications. Gender stereotypes can be harmful and limiting for the individuals they target, whether they consist of misrepresentation or discrimination. Recognizing gender bias as a pervasive societal construct, this paper studies how to uncover and quantify the presence of gender biases in generative language models. In particular, we derive generative AI analogues of three well-known non-discrimination criteria from classification, namely independence, separation and sufficiency. To demonstrate these criteria in action, we design prompts for each of the criteria with a focus on occupational gender stereotype, specifically utilizing the medical test to introduce the ground truth in the generative AI context. Our results address the presence of occupational gender bias within such conversational language models.
- [1284] arXiv:2403.08593 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured EnvironmentsSitao Cheng , Ziyuan Zhuang , Yong Xu , Fangkai Yang , Chaoyun Zhang , Xiaoting Qin , Xiang Huang , Ling Chen , Qingwei Lin , Dongmei Zhang , Saravan Rajmohan , Qi ZhangComments: 17 pages, 8 figures, 9 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have shown potential in reasoning over structured environments, e.g., knowledge graph and table. Such tasks typically require multi-hop reasoning, i.e., match natural language utterance with instances in the environment. Previous methods leverage LLMs to incrementally build a reasoning path, where the LLMs either invoke tools or pick up schemas by step-by-step interacting with the environment. We propose Reasoning-Path-Editing (Readi), a novel framework where LLMs can efficiently and faithfully reason over structured environments. In Readi, LLMs initially generate a reasoning path given a query, and edit the path only when necessary. We instantiate the path on structured environments and provide feedback to edit the path if anything goes wrong. Experimental results on three KGQA datasets and two TableQA datasets show the effectiveness of Readi, significantly surpassing all LLM-based methods (by 9.1% on WebQSP, 12.4% on MQA-3H and 10.9% on WTQ), comparable with state-of-the-art fine-tuned methods (67% on CWQ and 74.7% on WebQSP) and substantially boosting the vanilla LLMs (by 14.9% on CWQ). Our code will be available upon publication.
- [1285] arXiv:2403.08607 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MedInsight: A Multi-Source Context Augmentation Framework for Generating Patient-Centric Medical Responses using Large Language ModelsSubash Neupane , Shaswata Mitra , Sudip Mittal , Noorbakhsh Amiri Golilarz , Shahram Rahimi , Amin AmirlatifiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have shown impressive capabilities in generating human-like responses. However, their lack of domain-specific knowledge limits their applicability in healthcare settings, where contextual and comprehensive responses are vital. To address this challenge and enable the generation of patient-centric responses that are contextually relevant and comprehensive, we propose MedInsight:a novel retrieval augmented framework that augments LLM inputs (prompts) with relevant background information from multiple sources. MedInsight extracts pertinent details from the patient's medical record or consultation transcript. It then integrates information from authoritative medical textbooks and curated web resources based on the patient's health history and condition. By constructing an augmented context combining the patient's record with relevant medical knowledge, MedInsight generates enriched, patient-specific responses tailored for healthcare applications such as diagnosis, treatment recommendations, or patient education. Experiments on the MTSamples dataset validate MedInsight's effectiveness in generating contextually appropriate medical responses. Quantitative evaluation using the Ragas metric and TruLens for answer similarity and answer correctness demonstrates the model's efficacy. Furthermore, human evaluation studies involving Subject Matter Expert (SMEs) confirm MedInsight's utility, with moderate inter-rater agreement on the relevance and correctness of the generated responses.
- [1286] arXiv:2403.08613 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Link Prediction for Social Networks using Representation Learning and Heuristic-based FeaturesComments: Accepted to the MAISoN Workshop at IJCAI 2023Subjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The exponential growth in scale and relevance of social networks enable them to provide expansive insights. Predicting missing links in social networks efficiently can help in various modern-day business applications ranging from generating recommendations to influence analysis. Several categories of solutions exist for the same. Here, we explore various feature extraction techniques to generate representations of nodes and edges in a social network that allow us to predict missing links. We compare the results of using ten feature extraction techniques categorized across Structural embeddings, Neighborhood-based embeddings, Graph Neural Networks, and Graph Heuristics, followed by modeling with ensemble classifiers and custom Neural Networks. Further, we propose combining heuristic-based features and learned representations that demonstrate improved performance for the link prediction task on social network datasets. Using this method to generate accurate recommendations for many applications is a matter of further study that appears very promising. The code for all the experiments has been made public.
- [1287] arXiv:2403.08618 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Verifix: Post-Training Correction to Improve Label Noise Robustness with Verified SamplesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Label corruption, where training samples have incorrect labels, can significantly degrade the performance of machine learning models. This corruption often arises from non-expert labeling or adversarial attacks. Acquiring large, perfectly labeled datasets is costly, and retraining large models from scratch when a clean dataset becomes available is computationally expensive. To address this challenge, we propose Post-Training Correction, a new paradigm that adjusts model parameters after initial training to mitigate label noise, eliminating the need for retraining. We introduce Verifix, a novel Singular Value Decomposition (SVD) based algorithm that leverages a small, verified dataset to correct the model weights using a single update. Verifix uses SVD to estimate a Clean Activation Space and then projects the model's weights onto this space to suppress activations corresponding to corrupted data. We demonstrate Verifix's effectiveness on both synthetic and real-world label noise. Experiments on the CIFAR dataset with 25% synthetic corruption show 7.36% generalization improvements on average. Additionally, we observe generalization improvements of up to 2.63% on naturally corrupted datasets like WebVision1.0 and Clothing1M.
- [1288] arXiv:2403.08635 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Human Alignment of Large Language Models through Online Preference OptimisationDaniele Calandriello , Daniel Guo , Remi Munos , Mark Rowland , Yunhao Tang , Bernardo Avila Pires , Pierre Harvey Richemond , Charline Le Lan , Michal Valko , Tianqi Liu , Rishabh Joshi , Zeyu Zheng , Bilal PiotSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD.
This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task. - [1289] arXiv:2403.08688 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Token Alignment via Character Matching for Subword CompletionBen Athiwaratkun , Shiqi Wang , Mingyue Shang , Yuchen Tian , Zijian Wang , Sujan Kumar Gonugondla , Sanjay Krishna Gouda , Rob Kwiatowski , Ramesh Nallapati , Bing XiangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining performance even in regular non-subword cases. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model's generation aligns with the prompt. This approach showcases marked improvement across many partial token scenarios, including nuanced cases like space-prefix and partial indentation, with only a minor time increase. The technique and analysis detailed in this paper contribute to the continuous advancement of generative models in handling partial inputs, bearing relevance for applications like code completion and text autocompletion.
- [1290] arXiv:2403.08699 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Implicit Regularization of Gradient Flow on One-Layer Softmax AttentionComments: 34 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract: We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model, where the key and query weight matrices are trained separately. Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices. Such implicit regularization can be described by a Support Vector Machine (SVM) problem with respect to the attention weights. This finding contrasts with prior results showing that the gradient descent induces an implicit regularization on the Frobenius norm on the product weight matrix when the key and query matrices are combined into a single weight matrix for training. For diagonal key and query matrices, our analysis builds upon the reparameterization technique and exploits approximate KKT conditions of the SVM associated with the classification data. Moreover, the results are extended to general weights configurations given proper alignment of the weight matrices' singular spaces with the data features at initialization.
- [1291] arXiv:2403.08728 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models trained on Corrupted DataAsad Aali , Giannis Daras , Brett Levac , Sidharth Kumar , Alexandros G. Dimakis , Jonathan I. TamirComments: Pre-print, work in progressSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We provide a framework for solving inverse problems with diffusion models learned from linearly corrupted data. Our method, Ambient Diffusion Posterior Sampling (A-DPS), leverages a generative model pre-trained on one type of corruption (e.g. image inpainting) to perform posterior sampling conditioned on measurements from a potentially different forward process (e.g. image blurring). We test the efficacy of our approach on standard natural image datasets (CelebA, FFHQ, and AFHQ) and we show that A-DPS can sometimes outperform models trained on clean data for several image restoration tasks in both speed and performance. We further extend the Ambient Diffusion framework to train MRI models with access only to Fourier subsampled multi-coil MRI measurements at various acceleration factors (R=2, 4, 6, 8). We again observe that models trained on highly subsampled data are better priors for solving inverse problems in the high acceleration regime than models trained on fully sampled data. We open-source our code and the trained Ambient Diffusion MRI models: this https URL .
- [1292] arXiv:2403.08739 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language ModelsComments: 15 pagesSubjects: Computation and Language (cs.CL) ; Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
Abstract: A substantial gap persists in understanding the reasons behind the exceptional performance of the Transformer architecture in NLP. A particularly unexplored area involves the mechanistic description of how the distribution of parameters evolves over time during training. In this work we suggest that looking at the time evolution of the statistic distribution of model parameters, and specifically at bifurcation effects, can help understanding the model quality, potentially reducing training costs and evaluation efforts and empirically showing the reasons behind the effectiveness of weights sparsification.
- [1293] arXiv:2403.08743 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing FrameworkComments: 18 pages, 11 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) can easily generate biased and discriminative responses. As LLMs tap into consequential decision-making (e.g., hiring and healthcare), it is of crucial importance to develop strategies to mitigate these biases. This paper focuses on social bias, tackling the association between demographic information and LLM outputs. We propose a causality-guided debiasing framework that utilizes causal understandings of (1) the data-generating process of the training corpus fed to LLMs, and (2) the internal reasoning process of LLM inference, to guide the design of prompts for debiasing LLM outputs through selection mechanisms. Our framework unifies existing de-biasing prompting approaches such as inhibitive instructions and in-context contrastive examples, and sheds light on new ways of debiasing by encouraging bias-free reasoning. Our strong empirical performance on real-world datasets demonstrates that our framework provides principled guidelines on debiasing LLM outputs even with only the black-box access.
- [1294] arXiv:2403.08755 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DAM: Dynamic Adapter Merging for Continual Video QA LearningComments: The first two authors contribute equallySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, (iii) handle inputs from unknown datasets during inference, and (iv) enable knowledge sharing across similar dataset domains. Given a set of continually streaming VidQA datasets, we sequentially train dataset-specific adapters for each dataset while freezing the parameters of a large pretrained video-language backbone. During inference, given a video-question sample from an unknown domain, our method first uses the proposed non-parametric router function to compute a probability for each adapter, reflecting how relevant that adapter is to the current video-question input instance. Subsequently, the proposed dynamic adapter merging scheme aggregates all the adapter weights into a new adapter instance tailored for that particular test sample to compute the final VidQA prediction, mitigating the impact of inaccurate router predictions and facilitating knowledge sharing across domains. Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains. We further extend DAM to continual image classification and image QA and outperform prior methods by a large margin. The code is publicly available at: this https URL
- [1295] arXiv:2403.08763 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Simple and Scalable Strategies to Continually Pre-train Large Language ModelsAdam Ibrahim , Benjamin Thérien , Kshitij Gupta , Mats L. Richter , Quentin Anthony , Timothée Lesort , Eugene Belilovsky , Irina RishSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
- [1296] arXiv:2403.08770 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FastMAC: Stochastic Spectral Sampling of Correspondence GraphComments: CVPR 2024, Code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: 3D correspondence, i.e., a pair of 3D points, is a fundamental concept in computer vision. A set of 3D correspondences, when equipped with compatibility edges, forms a correspondence graph. This graph is a critical component in several state-of-the-art 3D point cloud registration approaches, e.g., the one based on maximal cliques (MAC). However, its properties have not been well understood. So we present the first study that introduces graph signal processing into the domain of correspondence graph. We exploit the generalized degree signal on correspondence graph and pursue sampling strategies that preserve high-frequency components of this signal. To address time-consuming singular value decomposition in deterministic sampling, we resort to a stochastic approximate sampling strategy. As such, the core of our method is the stochastic spectral sampling of correspondence graph. As an application, we build a complete 3D registration algorithm termed as FastMAC, that reaches real-time speed while leading to little to none performance drop. Through extensive experiments, we validate that FastMAC works for both indoor and outdoor benchmarks. For example, FastMAC can accelerate MAC by 80 times while maintaining high registration success rate on KITTI. Codes are publicly available at this https URL .
- [1297] arXiv:2403.08773 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Veagle: Advancements in Multimodal Representation LearningRajat Chawla , Arkajit Datta , Tushar Verma , Adarsh Jha , Anmol Gautam , Ayush Vatsal , Sukrit Chaterjee , Mukunda NS , Ishaan BholaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Abstract: Lately, researchers in artificial intelligence have been really interested in how language and vision come together, giving rise to the development of multimodal models that aim to seamlessly integrate textual and visual information. Multimodal models, an extension of Large Language Models (LLMs), have exhibited remarkable capabilities in addressing a diverse array of tasks, ranging from image captioning and visual question answering (VQA) to visual grounding. While these models have showcased significant advancements, challenges persist in accurately interpreting images and answering the question, a common occurrence in real-world scenarios. This paper introduces a novel approach to enhance the multimodal capabilities of existing models. In response to the limitations observed in current Vision Language Models (VLMs) and Multimodal Large Language Models (MLLMs), our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Veagle leverages a dynamic mechanism to project encoded visual information directly into the language model. This dynamic approach allows for a more nuanced understanding of intricate details present in visual contexts. To validate the effectiveness of Veagle, we conduct comprehensive experiments on benchmark datasets, emphasizing tasks such as visual question answering and image understanding. Our results indicate a improvement of 5-6 \% in performance, with Veagle outperforming existing models by a notable margin. The outcomes underscore the model's versatility and applicability beyond traditional benchmarks.
- [1298] arXiv:2403.08774 (cross-list from physics.soc-ph) [ pdf , ps , other ]
-
Title: Discussion of Loop Expansion and Introduction of Series Cutting Functions to Local Potential Approximation: Complexity Analysis Using Green's Functions, Cutting Of Nth-Order Social Interactions For Progressive SafetyComments: In this study, we focus on the aforementioned paper, "Examination Kubo-Matsubara Green's Function Of The Edwards-Anderson Model: Extreme Value Information Flow Of Nth-Order Interpolated Extrapolation Of Zero Phenomena Using The Replica Method (2024)"Subjects: Physics and Society (physics.soc-ph) ; Artificial Intelligence (cs.AI)
Abstract: In this study, we focus on the aforementioned paper, "Examination Kubo-Matsubara Green's Function Of The Edwards-Anderson Model: Extreme Value Information Flow Of Nth-Order Interpolated Extrapolation Of Zero Phenomena Using The Replica Method (2024)". This paper also applies theoretical physics methods to better understand the filter bubble phenomenon, focusing in particular on loop expansions and truncation functions. Using the loop expansion method, the complexity of social interactions during the occurrence of filter bubbles will be discussed in order to introduce series, express mathematically, and evaluate the impact of these interactions. We analyze the interactions between agents and their time evolution using a variety of Green's functions, including delayed Green's functions, advanced Green's functions, and causal Green's functions, to capture the dynamic response of the system through local potential approximations. In addition, we apply truncation functions and truncation techniques to ensure incremental safety and evaluate the long-term stability of the system. This approach will enable a better understanding of the mechanisms of filter bubble generation and dissolution, and discuss insights into their prevention and management. This research explores the possibilities of applying theoretical physics frameworks to social science problems and examines methods for analyzing the complex dynamics of information flow and opinion formation in digital society.This paper is partially an attempt to utilize "Generative AI" and was written with educational intent. There are currently no plans for it to become a peer-reviewed paper.
- [1299] arXiv:2403.08775 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Constrained Reinforcement Learning for Adaptive Controller Synchronization in Distributed SDNSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: In software-defined networking (SDN), the implementation of distributed SDN controllers, with each controller responsible for managing a specific sub-network or domain, plays a critical role in achieving a balance between centralized control, scalability, reliability, and network efficiency. These controllers must be synchronized to maintain a logically centralized view of the entire network. While there are various approaches for synchronizing distributed SDN controllers, most tend to prioritize goals such as optimization of communication latency or load balancing, often neglecting to address both the aspects simultaneously. This limitation becomes particularly significant when considering applications like Augmented and Virtual Reality (AR/VR), which demand constrained network latencies and substantial computational resources. Additionally, many existing studies in this field predominantly rely on value-based reinforcement learning (RL) methods, overlooking the potential advantages offered by state-of-the-art policy-based RL algorithms. To bridge this gap, our work focuses on examining deep reinforcement learning (DRL) techniques, encompassing both value-based and policy-based methods, to guarantee an upper latency threshold for AR/VR task offloading within SDN environments, while selecting the most cost-effective servers for AR/VR task offloading. Our evaluation results indicate that while value-based methods excel in optimizing individual network metrics such as latency or load balancing, policy-based approaches exhibit greater robustness in adapting to sudden network changes or reconfiguration.
- [1300] arXiv:2403.08776 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Leveraging Chat-Based Large Vision Language Models for Multimodal Out-Of-Context DetectionComments: 13 pages, 6 figures , conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Out-of-context (OOC) detection is a challenging task involving identifying images and texts that are irrelevant to the context in which they are presented. Large vision-language models (LVLMs) are effective at various tasks, including image classification and text generation. However, the extent of their proficiency in multimodal OOC detection tasks is unclear. In this paper, we investigate the ability of LVLMs to detect multimodal OOC and show that these models cannot achieve high accuracy on OOC detection tasks without fine-tuning. However, we demonstrate that fine-tuning LVLMs on multimodal OOC datasets can further improve their OOC detection accuracy. To evaluate the performance of LVLMs on OOC detection tasks, we fine-tune MiniGPT-4 on the NewsCLIPpings dataset, a large dataset of multimodal OOC. Our results show that fine-tuning MiniGPT-4 on the NewsCLIPpings dataset significantly improves the OOC detection accuracy in this dataset. This suggests that fine-tuning can significantly improve the performance of LVLMs on OOC detection tasks.
- [1301] arXiv:2403.08782 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Procedural terrain generation with style transferSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this study we introduce a new technique for the generation of terrain maps, exploiting a combination of procedural generation and Neural Style Transfer. We consider our approach to be a viable alternative to competing generative models, with our technique achieving greater versatility, lower hardware requirements and greater integration in the creative process of designers and developers. Our method involves generating procedural noise maps using either multi-layered smoothed Gaussian noise or the Perlin algorithm. We then employ an enhanced Neural Style transfer technique, drawing style from real-world height maps. This fusion of algorithmic generation and neural processing holds the potential to produce terrains that are not only diverse but also closely aligned with the morphological characteristics of real-world landscapes, with our process yielding consistent terrain structures with low computational cost and offering the capability to create customized maps. Numerical evaluations further validate our model's enhanced ability to accurately replicate terrain morphology, surpassing traditional procedural methods.
- [1302] arXiv:2403.08783 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Image-Text Out-Of-Context Detection Using Synthetic Multimodal MisinformationComments: 8 pages, 2 figures, conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Misinformation has become a major challenge in the era of increasing digital information, requiring the development of effective detection methods. We have investigated a novel approach to Out-Of-Context detection (OOCD) that uses synthetic data generation. We created a dataset specifically designed for OOCD and developed an efficient detector for accurate classification. Our experimental findings validate the use of synthetic data generation and demonstrate its efficacy in addressing the data limitations associated with OOCD. The dataset and detector should serve as valuable resources for future research and the development of robust misinformation detection systems.
- [1303] arXiv:2403.08786 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: One-Spike SNN: Single-Spike Phase Coding with Base Manipulation for ANN-to-SNN Conversion Loss MinimizationComments: 11 pages, 10 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: As spiking neural networks (SNNs) are event-driven, energy efficiency is higher than conventional artificial neural networks (ANNs). Since SNN delivers data through discrete spikes, it is difficult to use gradient methods for training, limiting its accuracy. To keep the accuracy of SNNs similar to ANN counterparts, pre-trained ANNs are converted to SNNs (ANN-to-SNN conversion). During the conversion, encoding activations of ANNs to a set of spikes in SNNs is crucial for minimizing the conversion loss. In this work, we propose a single-spike phase coding as an encoding scheme that minimizes the number of spikes to transfer data between SNN layers. To minimize the encoding error due to single-spike approximation in phase coding, threshold shift and base manipulation are proposed. Without any additional retraining or architectural constraints on ANNs, the proposed conversion method does not lose inference accuracy (0.58% on average) verified on three convolutional neural networks (CNNs) with CIFAR and ImageNet this http URL addition, graph convolutional networks (GCNs) are converted to SNNs successfully with an average accuracy loss of 0.90%.Most importantly, the energy efficiency of our SNN improves by 4.6~17.3 X compared to the ANN baseline.
- [1304] arXiv:2403.08788 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Verification for Object Detection -- IBP IoUNoémie Cohen , Mélanie Ducoffe , Ryma Boumazouza (CRIL), Christophe Gabreau , Claire Pagetti , Xavier Pucel , Audrey GalametzSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: We introduce a novel Interval Bound Propagation (IBP) approach for the formal verification of object detection models, specifically targeting the Intersection over Union (IoU) metric. The approach has been implemented in an open source code, named IBP IoU, compatible with popular abstract interpretation based verification tools. The resulting verifier is evaluated on landing approach runway detection and handwritten digit recognition case studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the superior performance of IBP IoU in ensuring accuracy and stability, contributing to more secure and robust machine learning applications.
- [1305] arXiv:2403.08789 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Bridging Human Concepts and Computer Vision for Explainable Face VerificationMiriam Doh (UMons, IRIDIA), Caroline Mazini Rodrigues (LRDE, LIGM), Nicolas Boutry (LRDE), Laurent Najman (LIGM), Matei Mancas (UMONS), Hugues Bersini (IRIDIA)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: With Artificial Intelligence (AI) influencing the decision-making process of sensitive applications such as Face Verification, it is fundamental to ensure the transparency, fairness, and accountability of decisions. Although Explainable Artificial Intelligence (XAI) techniques exist to clarify AI decisions, it is equally important to provide interpretability of these decisions to humans. In this paper, we present an approach to combine computer and human vision to increase the explanation's interpretability of a face verification algorithm. In particular, we are inspired by the human perceptual process to understand how machines perceive face's human-semantic areas during face comparison tasks. We use Mediapipe, which provides a segmentation technique that identifies distinct human-semantic facial regions, enabling the machine's perception analysis. Additionally, we adapted two model-agnostic algorithms to provide human-interpretable insights into the decision-making processes.
- [1306] arXiv:2403.08790 (cross-list from cs.DC) [ pdf , ps , other ]
-
Title: Using Sequential Runtime Distributions for the Parallel Speedup Prediction of SAT Local SearchJournal-ref: Theory and Practice of Logic Programming. 2013;13(4-5):625-639Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a detailed analysis of the scalability and parallelization of local search algorithms for the Satisfiability problem. We propose a framework to estimate the parallel performance of a given algorithm by analyzing the runtime behavior of its sequential version. Indeed, by approximating the runtime distribution of the sequential process with statistical methods, the runtime behavior of the parallel process can be predicted by a model based on order statistics. We apply this approach to study the parallel performance of two SAT local search solvers, namely Sparrow and CCASAT, and compare the predicted performances to the results of an actual experimentation on parallel hardware up to 384 cores. We show that the model is accurate and predicts performance close to the empirical data. Moreover, as we study different types of instances (random and crafted), we observe that the local search solvers exhibit different behaviors and that their runtime distributions can be approximated by two types of distributions: exponential (shifted and non-shifted) and lognormal.
- [1307] arXiv:2403.08797 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Evolutionary Algorithms Simulating Molecular Evolution: A New Field ProposalComments: 7 pages, 2 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: The genetic blueprint for the essential functions of life is encoded in DNA, which is translated into proteins -- the engines driving most of our metabolic processes. Recent advancements in genome sequencing have unveiled a vast diversity of protein families, but compared to the massive search space of all possible amino acid sequences, the set of known functional families is minimal. One could say nature has a limited protein "vocabulary." The major question for computational biologists, therefore, is whether this vocabulary can be expanded to include useful proteins that went extinct long ago, or maybe never evolved in the first place. We outline a computational approach to solving this problem. By merging evolutionary algorithms, machine learning (ML), and bioinformatics, we can facilitate the development of completely novel proteins which have never existed before. We envision this work forming a new sub-field of computational evolution we dub evolutionary algorithms simulating molecular evolution (EASME).
- [1308] arXiv:2403.08807 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Effective anytime algorithm for multiobjective combinatorial optimization problemsJournal-ref: Inf. Sci. 565: 210-228 (2021)Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: In multiobjective optimization, the result of an optimization algorithm is a set of efficient solutions from which the decision maker selects one. It is common that not all the efficient solutions can be computed in a short time and the search algorithm has to be stopped prematurely to analyze the solutions found so far. A set of efficient solutions that are well-spread in the objective space is preferred to provide the decision maker with a great variety of solutions. However, just a few exact algorithms in the literature exist with the ability to provide such a well-spread set of solutions at any moment: we call them anytime algorithms. We propose a new exact anytime algorithm for multiobjective combinatorial optimization combining three novel ideas to enhance the anytime behavior. We compare the proposed algorithm with those in the state-of-the-art for anytime multiobjective combinatorial optimization using a set of 480 instances from different well-known benchmarks and four different performance measures: the overall non-dominated vector generation ratio, the hypervolume, the general spread and the additive epsilon indicator. A comprehensive experimental study reveals that our proposal outperforms the previous algorithms in most of the instances.
- [1309] arXiv:2403.08808 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: A Bionic Data-driven Approach for Long-distance Underwater Navigation with Anomaly ResistanceSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Various animals exhibit accurate navigation using environment cues. The Earth's magnetic field has been proved a reliable information source in long-distance fauna migration. Inspired by animal navigation, this work proposes a bionic and data-driven approach for long-distance underwater navigation. The proposed approach uses measured geomagnetic data for the navigation, and requires no GPS systems or geographical maps. Particularly, we construct and train a Temporal Attention-based Long Short-Term Memory (TA-LSTM) network to predict the heading angle during the navigation. To mitigate the impact of geomagnetic anomalies, we develop the mechanism to detect and quantify the anomalies based on Maximum Likelihood Estimation. We integrate the developed mechanism with the TA-LSTM, and calibrate the predicted heading angles to gain resistance against geomagnetic anomalies. Using the retrieved data from the WMM model, we conduct numerical simulations with diversified navigation conditions to test our approach. The simulation results demonstrate a resilience navigation against geomagnetic anomalies by our approach, along with precision and stability of the underwater navigation in single and multiple destination missions.
- [1310] arXiv:2403.08810 (cross-list from cs.NI) [ pdf , ps , other ]
-
Title: Comparison of edge computing methods in Internet of Things architectures for efficient estimation of indoor environmental parameters with Machine LearningJournal-ref: Engineering Applications of Artificial Intelligence, 2023, vol. 126, Part D, no. 107149, pp. 1-27, ISSN 0952-1976Subjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG)
Abstract: The large increase in the number of Internet of Things (IoT) devices have revolutionised the way data is processed, which added to the current trend from cloud to edge computing has resulted in the need for efficient and reliable data processing near the data sources using energy-efficient devices. Two methods based on low-cost edge-IoT architectures are proposed to implement lightweight Machine Learning (ML) models that estimate indoor environmental quality (IEQ) parameters, such as Artificial Neural Networks of Multilayer Perceptron type. Their implementation is based on centralised and distributed parallel IoT architectures, connected via wireless, which share commercial off-the-self modules for data acquisition and sensing, such as sensors for temperature, humidity, illuminance, CO2, and other gases. The centralised method uses a Graphics Processing Unit and the Message Queuing Telemetry Transport protocol, but the distributed method utilises low performance ARM-based devices and the Message Passing Interface protocol. Although multiple IEQ parameters are measured, the training and testing of ML models is accomplished with experiments focused on small temperature and illuminance datasets to reduce data processing load, obtained from sudden spikes, square profiles and sawteeth test cases. The results show a high estimation performance with F-score and Accuracy values close to 0.95, and an almost theorical Speedup with a reduction in power consumption close to 37% in the distributed parallel approach. In addition, similar or slightly better performance is achieved compared to equivalent IoT architectures from related research, but error reduction of 35 to 76% is accomplished with an adequate balance between performance and energy efficiency.
- [1311] arXiv:2403.08813 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Federated Deep Q-Learning and 5G load balancingComments: 5 pages, in Chinese language. 8 figures. Presented at 2022 Taiwan telecommunications annual symposiumSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: Despite advances in cellular network technology, base station (BS) load balancing remains a persistent problem. Although centralized resource allocation methods can address the load balancing problem, it still remains an NP-hard problem. In this research, we study how federated deep Q learning can be used to inform each user equipment (UE) of the each BS's load conditions. Federated deep Q learning's load balancing enables intelligent UEs to independently select the best BS while also limiting the amount of private information exposed to the network.
In this study, we propose and analyze a federated deep Q learning load balancing system, which is implemented using the Open-RAN xAPP framework and the near-Real Time Radio Interface Controller (near-RT RIC). Our simulation results indicate that compared to the maximum Signal-To-Noise-Ratio (MAX-SINR) method currently used by UEs, our proposed deep Q learning model can consistently provide better High average UE quality of service - [1312] arXiv:2403.08818 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multimodal Fusion of EHR in Structures and Semantics: Integrating Clinical Records and Notes with Hypergraph and LLMSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Electronic Health Records (EHRs) have become increasingly popular to support clinical decision-making and healthcare in recent decades. EHRs usually contain heterogeneous information, such as structural data in tabular form and unstructured data in textual notes. Different types of information in EHRs can complement each other and provide a more complete picture of the health status of a patient. While there has been a lot of research on representation learning of structured EHR data, the fusion of different types of EHR data (multimodal fusion) is not well studied. This is mostly because of the complex medical coding systems used and the noise and redundancy present in the written notes. In this work, we propose a new framework called MINGLE, which integrates both structures and semantics in EHR effectively. Our framework uses a two-level infusion strategy to combine medical concept semantics and clinical note semantics into hypergraph neural networks, which learn the complex interactions between different types of data to generate visit representations for downstream prediction. Experiment results on two EHR datasets, the public MIMIC-III and private CRADLE, show that MINGLE can effectively improve predictive performance by 11.83% relatively, enhancing semantic integration as well as multimodal fusion for structural and textual EHR data.
- [1313] arXiv:2403.08820 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Diet-ODIN: A Novel Framework for Opioid Misuse Detection with Interpretable Dietary PatternsZheyuan Zhang , Zehong Wang , Shifu Hou , Evan Hall , Landon Bachman , Vincent Galassi , Jasmine White , Nitesh V. Chawla , Chuxu Zhang , Yanfang YeSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: The opioid crisis has been one of the most critical society concerns in the United States. Although the medication assisted treatment (MAT) is recognized as the most effective treatment for opioid misuse and addiction, the various side effects can trigger opioid relapse. In addition to MAT, the dietary nutrition intervention has been demonstrated its importance in opioid misuse prevention and recovery. However, research on the alarming connections between dietary patterns and opioid misuse remain under-explored. In response to this gap, in this paper, we first establish a large-scale multifaceted dietary benchmark dataset related to opioid users at the first attempt and then develop a novel framework - i.e., namely Opioid Misuse Detection with Interpretable Dietary Patterns (Diet-ODIN) - to bridge heterogeneous graph (HG) and large language model (LLM) for the identification of users with opioid misuse and the interpretation of their associated dietary patterns. Specifically, in Diet-ODIN, we first construct an HG to comprehensively incorporate both dietary and health-related information, and then we devise a holistic graph learning framework with noise reduction to fully capitalize both users' individual dietary habits and shared dietary patterns for the detection of users with opioid misuse. To further delve into the intricate correlations between dietary patterns and opioid misuse, we exploit an LLM by utilizing the knowledge obtained from the graph learning model for interpretation. The extensive experimental results based on our established benchmark with quantitative and qualitative measures demonstrate the outstanding performance of Diet-ODIN in exploring the complex interplay between opioid misuse and dietary patterns, by comparison with state-of-the-art baseline methods.
- [1314] arXiv:2403.08824 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Measuring Non-Typical Emotions for Mental Health: A Survey of Computational ApproachesComments: Under review in IEEE Transactions on Affective ComputingSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Analysis of non-typical emotions, such as stress, depression and engagement is less common and more complex compared to that of frequently discussed emotions like happiness, sadness, fear, and anger. The importance of these non-typical emotions has been increasingly recognized due to their implications on mental health and well-being. Stress and depression impact the engagement in daily tasks, highlighting the need to understand their interplay. This survey is the first to simultaneously explore computational methods for analyzing stress, depression, and engagement. We discuss the most commonly used datasets, input modalities, data processing techniques, and information fusion methods used for the computational analysis of stress, depression and engagement. A timeline and taxonomy of non-typical emotion analysis approaches along with their generic pipeline and categories are presented. Subsequently, we describe state-of-the-art computational approaches for non-typical emotion analysis, including a performance summary on the most commonly used datasets. Following this, we explore the applications, along with the associated challenges, limitations, and future research directions.
- [1315] arXiv:2403.08828 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: People Attribute Purpose to Autonomous Vehicles When Explaining Their BehaviorBalint Gyevnar , Stephanie Droop , Tadeg Quillien , Shay B. Cohen , Neil R. Bramley , Christopher G. Lucas , Stefano V. AlbrechtSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Cognitive science can help us understand which explanations people might expect, and in which format they frame these explanations, whether causal, counterfactual, or teleological (i.e., purpose-oriented). Understanding the relevance of these concepts is crucial for building good explainable AI (XAI) which offers recourse and actionability. Focusing on autonomous driving, a complex decision-making domain, we report empirical data from two surveys on (i) how people explain the behavior of autonomous vehicles in 14 unique scenarios (N1=54), and (ii) how they perceive these explanations in terms of complexity, quality, and trustworthiness (N2=356). Participants deemed teleological explanations significantly better quality than counterfactual ones, with perceived teleology being the best predictor of perceived quality and trustworthiness. Neither the perceived teleology nor the quality were affected by whether the car was an autonomous vehicle or driven by a person. This indicates that people use teleology to evaluate information about not just other people but also autonomous vehicles. Taken together, our findings highlight the importance of explanations that are framed in terms of purpose rather than just, as is standard in XAI, the causal mechanisms involved. We release the 14 scenarios and more than 1,300 elicited explanations publicly as the Human Explanations for Autonomous Driving Decisions (HEADD) dataset.
- [1316] arXiv:2403.08833 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language NavigationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Zero-shot navigation is a critical challenge in Vision-Language Navigation (VLN) tasks, where the ability to adapt to unfamiliar instructions and to act in unknown environments is essential. Existing supervised learning-based models, trained using annotated data through reinforcement learning, exhibit limitations in generalization capabilities. Large Language Models (LLMs), with their extensive knowledge and emergent reasoning abilities, present a potential pathway for achieving zero-shot navigation. This paper presents a VLN agent based on LLMs, exploring approaches to the zero-shot navigation problem. To compensate for the shortcomings of LLMs in environmental perception, we propose the Thinking, Interacting, and Action (TINA) framework. TINA enables the agent to scrutinize perceptual information and autonomously query key clues within the environment through an introduced question-answering module, thereby aligning instructions with specific perceptual data. The navigation agent's perceptual abilities are enhanced through the TINA framework, while the explicit thought and query processes also improve the navigational procedure's explainability and transparency. We evaluate the performance of our method on the Room-to-Room dataset. The experiment results indicate that our approach improves the navigation performance of LLM-based agents. Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
- [1317] arXiv:2403.08834 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Predictive Analysis of Tuberculosis Treatment Outcomes Using Machine Learning: A Karnataka TB Data Study at a ScaleSeshaSai Nath Chinagudaba , Darshan Gera , Krishna Kiran Vamsi Dasu , Uma Shankar S , Kiran K , Anil Singarajpure , Shivayogappa.U , Somashekar N , Vineet Kumar Chadda , Sharath B NSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Tuberculosis (TB) remains a global health threat, ranking among the leading causes of mortality worldwide. In this context, machine learning (ML) has emerged as a transformative force, providing innovative solutions to the complexities associated with TB treatment.This study explores how machine learning, especially with tabular data, can be used to predict Tuberculosis (TB) treatment outcomes more accurately. It transforms this prediction task into a binary classification problem, generating risk scores from patient data sourced from NIKSHAY, India's national TB control program, which includes over 500,000 patient records.
Data preprocessing is a critical component of the study, and the model achieved an recall of 98% and an AUC-ROC score of 0.95 on the validation set, which includes 20,000 patient records.We also explore the use of Natural Language Processing (NLP) for improved model learning. Our results, corroborated by various metrics and ablation studies, validate the effectiveness of our approach. The study concludes by discussing the potential ramifications of our research on TB eradication efforts and proposing potential avenues for future work. This study marks a significant stride in the battle against TB, showcasing the potential of machine learning in healthcare. - [1318] arXiv:2403.08835 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Stacking-based deep neural network for player scouting in football 1Simon Lacan (IMT Nord Europe)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Datascouting is one of the most known data applications in professional sport, and specifically football. Its objective is to analyze huge database of players in order to detect high potentials that can be then individually considered by human scouts. In this paper, we propose a stacking-based deep learning model to detect high potential football players. Applied on open-source database, our model obtains significantly better results that classical statistical methods.
- [1319] arXiv:2403.08836 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Structural Positional Encoding for knowledge integration in transformer-based medical process monitoringSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Predictive process monitoring is a process mining task aimed at forecasting information about a running process trace, such as the most correct next activity to be executed. In medical domains, predictive process monitoring can provide valuable decision support in atypical and nontrivial situations. Decision support and quality assessment in medicine cannot ignore domain knowledge, in order to be grounded on all the available information (which is not limited to data) and to be really acceptable by end users.
In this paper, we propose a predictive process monitoring approach relying on the use of a {\em transformer}, a deep learning architecture based on the attention mechanism. A major contribution of our work lies in the incorporation of ontological domain-specific knowledge, carried out through a graph positional encoding technique. The paper presents and discusses the encouraging experimental result we are collecting in the domain of stroke management. - [1320] arXiv:2403.08837 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Cyclic Data Parallelism for Efficient Parallelism of Deep Neural NetworksLouis Fournier (MLIA), Edouard OyallonSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Abstract: Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory required to store the model's activations peaks at the end of the forward pass, and gradients must be simultaneously averaged at the end of the backpropagation step. We propose Cyclic Data Parallelism, a novel paradigm shifting the execution of the micro-batches from simultaneous to sequential, with a uniform delay. At the cost of a slight gradient delay, the total memory taken by activations is constant, and the gradient communications are balanced during the training step. With Model Parallelism, our technique reduces the number of GPUs needed, by sharing GPUs across micro-batches. Within the ZeRO-DP framework, our technique allows communication of the model states with point-to-point operations rather than a collective broadcast operation. We illustrate the strength of our approach on the CIFAR-10 and ImageNet datasets.
- [1321] arXiv:2403.08838 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Predictive Clustering of Vessel Behavior Based on Hierarchical Trajectory RepresentationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Vessel trajectory clustering, which aims to find similar trajectory patterns, has been widely leveraged in overwater applications. Most traditional methods use predefined rules and thresholds to identify discrete vessel behaviors. They aim for high-quality clustering and conduct clustering on entire sequences, whether the original trajectory or its sub-trajectories, failing to represent their evolution. To resolve this problem, we propose a Predictive Clustering of Hierarchical Vessel Behavior (PC-HiV). PC-HiV first uses hierarchical representations to transform every trajectory into a behavioral sequence. Then, it predicts evolution at each timestamp of the sequence based on the representations. By applying predictive clustering and latent encoding, PC-HiV improves clustering and predictions simultaneously. Experiments on real AIS datasets demonstrate PC-HiV's superiority over existing methods, showcasing its effectiveness in capturing behavioral evolution discrepancies between vessel types (tramp vs. liner) and within emission control areas. Results show that our method outperforms NN-Kmeans and Robust DAA by 3.9% and 6.4% of the purity score.
- [1322] arXiv:2403.08840 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear InterpolationComments: ICLR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Image interpolation based on diffusion models is promising in creating fresh and interesting images. Advanced interpolation methods mainly focus on spherical linear interpolation, where images are encoded into the noise space and then interpolated for denoising to images. However, existing methods face challenges in effectively interpolating natural images (not generated by diffusion models), thereby restricting their practical applicability. Our experimental investigations reveal that these challenges stem from the invalidity of the encoding noise, which may no longer obey the expected noise distribution, e.g., a normal distribution. To address these challenges, we propose a novel approach to correct noise for image interpolation, NoiseDiffusion. Specifically, NoiseDiffusion approaches the invalid noise to the expected distribution by introducing subtle Gaussian noise and introduces a constraint to suppress noise with extreme values. In this context, promoting noise validity contributes to mitigating image artifacts, but the constraint and introduced exogenous noise typically lead to a reduction in signal-to-noise ratio, i.e., loss of original image information. Hence, NoiseDiffusion performs interpolation within the noisy image space and injects raw images into these noisy counterparts to address the challenge of information loss. Consequently, NoiseDiffusion enables us to interpolate natural images without causing artifacts or information loss, thus achieving the best interpolation results.
- [1323] arXiv:2403.08844 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: AcademiaOS: Automating Grounded Theory Development in Qualitative Research with Large Language ModelsSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: AcademiaOS is a first attempt to automate grounded theory development in qualitative research with large language models. Using recent large language models' language understanding, generation, and reasoning capabilities, AcademiaOS codes curated qualitative raw data such as interview transcripts and develops themes and dimensions to further develop a grounded theoretical model, affording novel insights. A user study (n=19) suggests that the system finds acceptance in the academic community and exhibits the potential to augment humans in qualitative research. AcademiaOS has been made open-source for others to build upon and adapt to their use cases.
- [1324] arXiv:2403.08845 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Bifurcated Attention for Single-Context Large-Batch SamplingBen Athiwaratkun , Sujan Kumar Gonugondla , Sanjay Krishna Gouda , Haifeng Qian , Hantian Ding , Qing Sun , Jun Wang , Jiacheng Guo , Liangfu Chen , Parminder Bhatia , Ramesh Nallapati , Sudipta Sengupta , Bing XiangSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e.g., enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with postprocessing techniques such as reranking.
- [1325] arXiv:2403.08879 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multi-Objective Optimization Using Adaptive Distributed Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
Abstract: The Intelligent Transportation System (ITS) environment is known to be dynamic and distributed, where participants (vehicle users, operators, etc.) have multiple, changing and possibly conflicting objectives. Although Reinforcement Learning (RL) algorithms are commonly applied to optimize ITS applications such as resource management and offloading, most RL algorithms focus on single objectives. In many situations, converting a multi-objective problem into a single-objective one is impossible, intractable or insufficient, making such RL algorithms inapplicable. We propose a multi-objective, multi-agent reinforcement learning (MARL) algorithm with high learning efficiency and low computational requirements, which automatically triggers adaptive few-shot learning in a dynamic, distributed and noisy environment with sparse and delayed reward. We test our algorithm in an ITS environment with edge cloud computing. Empirical results show that the algorithm is quick to adapt to new environments and performs better in all individual and system metrics compared to the state-of-the-art benchmark. Our algorithm also addresses various practical concerns with its modularized and asynchronous online training method. In addition to the cloud simulation, we test our algorithm on a single-board computer and show that it can make inference in 6 milliseconds.
- [1326] arXiv:2403.08882 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Cultural evolution in populations of Large Language ModelsJérémy Perez , Corentin Léger , Marcela Ovando-Tellez , Chris Foulon , Joan Dussauld , Pierre-Yves Oudeyer , Clément Moulin-FrierComments: 17 pages, 20 figures. Open-source code available at this https URLSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Populations and Evolution (q-bio.PE)
Abstract: Research in cultural evolution aims at providing causal explanations for the change of culture over time. Over the past decades, this field has generated an important body of knowledge, using experimental, historical, and computational methods. While computational models have been very successful at generating testable hypotheses about the effects of several factors, such as population structure or transmission biases, some phenomena have so far been more complex to capture using agent-based and formal models. This is in particular the case for the effect of the transformations of social information induced by evolved cognitive mechanisms. We here propose that leveraging the capacity of Large Language Models (LLMs) to mimic human behavior may be fruitful to address this gap. On top of being an useful approximation of human cultural dynamics, multi-agents models featuring generative agents are also important to study for their own sake. Indeed, as artificial agents are bound to participate more and more to the evolution of culture, it is crucial to better understand the dynamics of machine-generated cultural evolution. We here present a framework for simulating cultural evolution in populations of LLMs, allowing the manipulation of variables known to be important in cultural evolution, such as network structure, personality, and the way social information is aggregated and transformed. The software we developed for conducting these simulations is open-source and features an intuitive user-interface, which we hope will help to build bridges between the fields of cultural evolution and generative artificial intelligence.
- [1327] arXiv:2403.08885 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SLCF-Net: Sequential LiDAR-Camera Fusion for Semantic Scene Completion using a 3D Recurrent U-NetComments: 2024 IEEE International Conference on Robotics and Automation (ICRA2024), Yokohama, Japan, May 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: We introduce SLCF-Net, a novel approach for the Semantic Scene Completion (SSC) task that sequentially fuses LiDAR and camera data. It jointly estimates missing geometry and semantics in a scene from sequences of RGB images and sparse LiDAR measurements. The images are semantically segmented by a pre-trained 2D U-Net and a dense depth prior is estimated from a depth-conditioned pipeline fueled by Depth Anything. To associate the 2D image features with the 3D scene volume, we introduce Gaussian-decay Depth-prior Projection (GDP). This module projects the 2D features into the 3D volume along the line of sight with a Gaussian-decay function, centered around the depth prior. Volumetric semantics is computed by a 3D U-Net. We propagate the hidden 3D U-Net state using the sensor motion and design a novel loss to ensure temporal consistency. We evaluate our approach on the SemanticKITTI dataset and compare it with leading SSC approaches. The SLCF-Net excels in all SSC metrics and shows great temporal consistency.
- [1328] arXiv:2403.08906 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: Strategizing against Q-learners: A Control-theoretical ApproachSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: In this paper, we explore the susceptibility of the Q-learning algorithm (a classical and widely used reinforcement learning method) to strategic manipulation of sophisticated opponents in games. We quantify how much a strategically sophisticated agent can exploit a naive Q-learner if she knows the opponent's Q-learning algorithm. To this end, we formulate the strategic actor's problem as a Markov decision process (with a continuum state space encompassing all possible Q-values) as if the Q-learning algorithm is the underlying dynamical system. We also present a quantization-based approximation scheme to tackle the continuum state space and analyze its performance both analytically and numerically.
- [1329] arXiv:2403.08915 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Cross-Modal Learning of Housing Quality in AmsterdamComments: Presented at SIGSpatial GeoAI workshop '21Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In our research we test data and models for the recognition of housing quality in the city of Amsterdam from ground-level and aerial imagery. For ground-level images we compare Google StreetView (GSV) to Flickr images. Our results show that GSV predicts the most accurate building quality scores, approximately 30% better than using only aerial images. However, we find that through careful filtering and by using the right pre-trained model, Flickr image features combined with aerial image features are able to halve the performance gap to GSV features from 30% to 15%. Our results indicate that there are viable alternatives to GSV for liveability factor prediction, which is encouraging as GSV images are more difficult to acquire and not always available.
- [1330] arXiv:2403.08933 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Unveiling the Truth: Exploring Human Gaze Patterns in Fake ImagesComments: Accepted to IEEE Signal Processing Letters 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: this https URL .
- [1331] arXiv:2403.08936 (cross-list from cs.MA) [ pdf , ps , html , other ]
-
Title: Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement LearningPeihong Yu , Manav Mishra , Alec Koppel , Carl Busart , Priya Narayan , Dinesh Manocha , Amrit Bedi , Pratap TokekarSubjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of policy behavior with demonstrations, and the second regulates incentives based on whether the behavior leads to the desired objective. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The results demonstrate that PegMARL learns near-optimal policies even when provided with suboptimal demonstrations, and outperforms state-of-the-art MARL algorithms in solving coordinated tasks. We also showcase PegMARL's capability to leverage joint demonstrations in the StarCraft scenario and converge effectively even with demonstrations from non-co-trained policies.
- [1332] arXiv:2403.08937 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Bugs in Large Language Models Generated Code: An Empirical StudyFlorian Tambon , Arghavan Moradi Dakhel , Amin Nikanjam , Foutse Khomh , Michel C. Desmarais , Giuliano AntoniolComments: 47 pages, 7 figuresSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) for code have gained significant attention recently. They can generate code in different programming languages based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community. Given the increasing adoption of LLM-based code generation tools (e.g., GitHub Copilot) in SE activities, it is critical to understand the characteristics of bugs contained in code generated by LLMs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong Input Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and Non-Prompted Consideration. The bug patterns are presented in the form of a taxonomy. The identified bug patterns are validated using an online survey with 34 LLM practitioners and researchers. The surveyed participants generally asserted the significance and prevalence of the bug patterns. Researchers and practitioners can leverage these findings to develop effective quality assurance techniques for LLM-generated code. This study sheds light on the distinctive characteristics of LLM-generated code.
- [1333] arXiv:2403.08944 (cross-list from cs.GT) [ pdf , ps , other ]
-
Title: Language-based game theory in the age of artificial intelligenceJournal-ref: Journal of the Royal Society Interface 21, 20230720 (2024)Subjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Theoretical Economics (econ.TH)
Abstract: Understanding human behaviour in decision problems and strategic interactions has wide-ranging applications in economics, psychology, and artificial intelligence. Game theory offers a robust foundation for this understanding, based on the idea that individuals aim to maximize a utility function. However, the exact factors influencing strategy choices remain elusive. While traditional models try to explain human behaviour as a function of the outcomes of available actions, recent experimental research reveals that linguistic content significantly impacts decision-making, thus prompting a paradigm shift from outcome-based to language-based utility functions. This shift is more urgent than ever, given the advancement of generative AI, which has the potential to support humans in making critical decisions through language-based interactions. We propose sentiment analysis as a fundamental tool for this shift and take an initial step by analyzing 61 experimental instructions from the dictator game, an economic game capturing the balance between self-interest and the interest of others, which is at the core of many social interactions. Our meta-analysis shows that sentiment analysis can explain human behaviour beyond economic outcomes. We discuss future research directions. We hope this work sets the stage for a novel game theoretical approach that emphasizes the importance of language in human decisions.
- [1334] arXiv:2403.08950 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Exploring Prompt Engineering Practices in the EnterpriseSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Interaction with Large Language Models (LLMs) is primarily carried out via prompting. A prompt is a natural language instruction designed to elicit certain behaviour or output from a model. In theory, natural language prompts enable non-experts to interact with and leverage LLMs. However, for complex tasks and tasks with specific requirements, prompt design is not trivial. Creating effective prompts requires skill and knowledge, as well as significant iteration in order to determine model behavior, and guide the model to accomplish a particular goal. We hypothesize that the way in which users iterate on their prompts can provide insight into how they think prompting and models work, as well as the kinds of support needed for more efficient prompt engineering. To better understand prompt engineering practices, we analyzed sessions of prompt editing behavior, categorizing the parts of prompts users iterated on and the types of changes they made. We discuss design implications and future directions based on these prompt engineering practices.
- [1335] arXiv:2403.08955 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity AnalysisSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: Reinforcement Learning (RL) has shown exceptional performance across various applications, enabling autonomous agents to learn optimal policies through interaction with their environments. However, traditional RL frameworks often face challenges in terms of iteration complexity and robustness. Risk-sensitive RL, which balances expected return and risk, has been explored for its potential to yield probabilistically robust policies, yet its iteration complexity analysis remains underexplored. In this study, we conduct a thorough iteration complexity analysis for the risk-sensitive policy gradient method, focusing on the REINFORCE algorithm and employing the exponential utility function. We obtain an iteration complexity of $\mathcal{O}(\epsilon^{-2})$ to reach an $\epsilon$-approximate first-order stationary point (FOSP). We investigate whether risk-sensitive algorithms can achieve better iteration complexity compared to their risk-neutral counterparts. Our theoretical analysis demonstrates that risk-sensitive REINFORCE can have a reduced number of iterations required for convergence. This leads to improved iteration complexity, as employing the exponential utility does not entail additional computation per iteration. We characterize the conditions under which risk-sensitive algorithms can achieve better iteration complexity. Our simulation results also validate that risk-averse cases can converge and stabilize more quickly after approximately half of the episodes compared to their risk-neutral counterparts.
- [1336] arXiv:2403.08956 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: AI coach for badmintonComments: 7 pages, 11 figures. this https URLJournal-ref: 2022 3rd International Conference for Emerging Technology (INCET), Belgaum, India, 2022, pp. 1-7Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: In the competitive realm of sports, optimal performance necessitates rigorous management of nutrition and physical conditioning. Specifically, in badminton, the agility and precision required make it an ideal candidate for motion analysis through video analytics. This study leverages advanced neural network methodologies to dissect video footage of badminton matches, aiming to extract detailed insights into player kinetics and biomechanics. Through the analysis of stroke mechanics, including hand-hip coordination, leg positioning, and the execution angles of strokes, the research aims to derive predictive models that can suggest improvements in stance, technique, and muscle orientation. These recommendations are designed to mitigate erroneous techniques, reduce the risk of joint fatigue, and enhance overall performance. Utilizing a vast array of data available online, this research correlates players' physical attributes with their in-game movements to identify muscle activation patterns during play. The goal is to offer personalized training and nutrition strategies that align with the specific biomechanical demands of badminton, thereby facilitating targeted performance enhancements.
- [1337] arXiv:2403.08962 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Using Deep Learning for Morphological Classification in Pigs with a Focus on Sanitary MonitoringEduardo Bedin , Junior Silva Souza , Gabriel Toshio Hirokawa Higa , Alexandre Pereira , Charles Kiefer , Newton Loebens , Hemerson PistoriSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The aim of this paper is to evaluate the use of D-CNN (Deep Convolutional Neural Networks) algorithms to classify pig body conditions in normal or not normal conditions, with a focus on characteristics that are observed in sanitary monitoring, and were used six different algorithms to do this task. The study focused on five pig characteristics, being these caudophagy, ear hematoma, scratches on the body, redness, and natural stains (brown or black). The results of the study showed that D-CNN was effective in classifying deviations in pig body morphologies related to skin characteristics. The evaluation was conducted by analyzing the performance metrics Precision, Recall, and F-score, as well as the statistical analyses ANOVA and the Scott-Knott test. The contribution of this article is characterized by the proposal of using D-CNN networks for morphological classification in pigs, with a focus on characteristics identified in sanitary monitoring. Among the best results, the average Precision metric of 80.6\% to classify caudophagy was achieved for the InceptionResNetV2 network, indicating the potential use of this technology for the proposed task. Additionally, a new image database was created, containing various pig's distinct body characteristics, which can serve as data for future research.
- [1338] arXiv:2403.08967 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and CaptioningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the field of computational histopathology, both whole slide images (WSIs) and diagnostic captions provide valuable insights for making diagnostic decisions. However, aligning WSIs with diagnostic captions presents a significant challenge. This difficulty arises from two main factors: 1) Gigapixel WSIs are unsuitable for direct input into deep learning models, and the redundancy and correlation among the patches demand more attention; and 2) Authentic WSI diagnostic captions are extremely limited, making it difficult to train an effective model. To overcome these obstacles, we present PathM3, a multimodal, multi-task, multiple instance learning (MIL) framework for WSI classification and captioning. PathM3 adapts a query-based transformer to effectively align WSIs with diagnostic captions. Given that histopathology visual patterns are redundantly distributed across WSIs, we aggregate each patch feature with MIL method that considers the correlations among instances. Furthermore, our PathM3 overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data in the manner of multi-task joint learning. Extensive experiments with improved classification accuracy and caption generation demonstrate the effectiveness of our method on both WSI classification and captioning task.
- [1339] arXiv:2403.08974 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Representing Anatomical Trees by Denoising Diffusion of Implicit Neural FieldsComments: Preprint. In review. Code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Anatomical trees play a central role in clinical diagnosis and treatment planning. However, accurately representing anatomical trees is challenging due to their varying and complex topology and geometry. Traditional methods for representing tree structures, captured using medical imaging, while invaluable for visualizing vascular and bronchial networks, exhibit drawbacks in terms of limited resolution, flexibility, and efficiency. Recently, implicit neural representations (INRs) have emerged as a powerful tool for representing shapes accurately and efficiently. We propose a novel approach for representing anatomical trees using INR, while also capturing the distribution of a set of trees via denoising diffusion in the space of INRs. We accurately capture the intricate geometries and topologies of anatomical trees at any desired resolution. Through extensive qualitative and quantitative evaluation, we demonstrate high-fidelity tree reconstruction with arbitrary resolution yet compact storage, and versatility across anatomical sites and tree complexities.
- [1340] arXiv:2403.08984 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Safe Road-Crossing by Autonomous Wheelchairs: a Novel Dataset and its Experimental EvaluationComments: 14 pages, 8 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Safe road-crossing by self-driving vehicles is a crucial problem to address in smart-cities. In this paper, we introduce a multi-sensor fusion approach to support road-crossing decisions in a system composed by an autonomous wheelchair and a flying drone featuring a robust sensory system made of diverse and redundant components. To that aim, we designed an analytical danger function based on explainable physical conditions evaluated by single sensors, including those using machine learning and artificial vision. As a proof-of-concept, we provide an experimental evaluation in a laboratory environment, showing the advantages of using multiple sensors, which can improve decision accuracy and effectively support safety assessment. We made the dataset available to the scientific community for further experimentation. The work has been developed in the context of an European project named REXASI-PRO, which aims to develop trustworthy artificial intelligence for social navigation of people with reduced mobility.
- [1341] arXiv:2403.09024 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Semiparametric Token-Sequence Co-SupervisionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate language model tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another language model.
- [1342] arXiv:2403.09029 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight DatasetSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
- [1343] arXiv:2403.09039 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Spatial-temporal Memories Enhanced Graph Autoencoder for Anomaly Detection in Dynamic GraphsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Anomaly detection in dynamic graphs presents a significant challenge due to the temporal evolution of graph structures and attributes. The conventional approaches that tackle this problem typically employ an unsupervised learning framework, capturing normality patterns with exclusive normal data during training and identifying deviations as anomalies during testing. However, these methods face critical drawbacks: they either only depend on proxy tasks for general representation without directly pinpointing normal patterns, or they neglect to differentiate between spatial and temporal normality patterns, leading to diminished efficacy in anomaly detection. To address these challenges, we introduce a novel Spatial-Temporal memories-enhanced graph autoencoder (STRIPE). Initially, STRIPE employs Graph Neural Networks (GNNs) and gated temporal convolution layers to extract spatial features and temporal features, respectively. Then STRIPE incorporates separate spatial and temporal memory networks, which capture and store prototypes of normal patterns, thereby preserving the uniqueness of spatial and temporal normality. After that, through a mutual attention mechanism, these stored patterns are then retrieved and integrated with encoded graph embeddings. Finally, the integrated features are fed into the decoder to reconstruct the graph streams which serve as the proxy task for anomaly detection. This comprehensive approach not only minimizes reconstruction errors but also refines the model by emphasizing the compactness and distinctiveness of the embeddings in relation to the nearest memory prototypes. Through extensive testing, STRIPE has demonstrated a superior capability to discern anomalies by effectively leveraging the distinct spatial and temporal dynamics of dynamic graphs, significantly outperforming existing methodologies, with an average improvement of 15.39% on AUC values.
- [1344] arXiv:2403.09053 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Towards a theory of model distillationComments: 46 pages, 5 figures. Please reach out with comments! Feedback is welcomeSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original [BCNM06,HVD15]. Despite many practical applications, basic questions about the extent to which models can be distilled, and the runtime and amount of data needed to distill, remain largely open.
To study these questions, we initiate a general theory of distillation, defining PAC-distillation in an analogous way to PAC-learning [Val84]. As applications of this theory: (1) we propose new algorithms to extract the knowledge stored in the trained weights of neural networks -- we show how to efficiently distill neural networks into succinct, explicit decision tree representations when possible by using the ``linear representation hypothesis''; and (2) we prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity. - [1345] arXiv:2403.09054 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative InferenceMuhammad Adnan , Akhil Arunkumar , Gaurav Jain , Prashant J. Nair , Ilya Soloveychik , Purushotham KamathJournal-ref: Proceedings of the 7th Annual Conference on Machine Learning and Systems (MLSys), 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
Abstract: Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs.
This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy. - [1346] arXiv:2403.09057 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Continued Pretrained LLM Approach for Automatic Medical Note GenerationDong Yuan , Eti Rastogi , Gautam Naik , Sree Prasanna Rajagopal , Sagar Goyal , Fen Zhao , Bharath Chintagunta , Jeff WardComments: Accepted to NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: LLMs are revolutionizing NLP tasks. However, the use of the most advanced LLMs, such as GPT-4, is often prohibitively expensive for most specialized fields. We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing. Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an accuracy of 78.4\%. It also achieves parity with GPT-4 in generating medical notes. Remarkably, HEAL surpasses GPT-4 and Med-PaLM 2 in identifying more correct medical concepts and exceeds the performance of human scribes and other comparable models in correctness and completeness.
- [1347] arXiv:2403.09063 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Distribution and Depth-Aware Transformers for 3D Human Mesh RecoveryComments: Submitted to 21st International Conference on Robots and Vision (CRV'24), Guelph, Ontario, CanadaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Precise Human Mesh Recovery (HMR) with in-the-wild data is a formidable challenge and is often hindered by depth ambiguities and reduced precision. Existing works resort to either pose priors or multi-modal data such as multi-view or point cloud information, though their methods often overlook the valuable scene-depth information inherently present in a single image. Moreover, achieving robust HMR for out-of-distribution (OOD) data is exceedingly challenging due to inherent variations in pose, shape and depth. Consequently, understanding the underlying distribution becomes a vital subproblem in modeling human forms. Motivated by the need for unambiguous and robust human modeling, we introduce Distribution and depth-aware human mesh recovery (D2A-HMR), an end-to-end transformer architecture meticulously designed to minimize the disparity between distributions and incorporate scene-depth leveraging prior depth information. Our approach demonstrates superior performance in handling OOD data in certain scenarios while consistently achieving competitive results against state-of-the-art HMR methods on controlled datasets.
- [1348] arXiv:2403.09072 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: UniCode: Learning a Unified Codebook for Multimodal Large Language ModelsComments: 14 pages, 2 figures, 11 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM's ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term ``image decompression'', enabling our model to interpret compressed visual data and generate high-quality images.The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks.
- [1349] arXiv:2403.09085 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Meaningful Learning: Advancing Abstract Reasoning in Large Language Models via Generic Fact GuidanceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have developed impressive performance and strong explainability across various reasoning scenarios, marking a significant stride towards mimicking human-like intelligence. Despite this, when tasked with simple questions supported by a generic fact, LLMs often fail to provide consistent and precise answers, indicating a deficiency in abstract reasoning abilities. This has sparked a vigorous debate about whether LLMs are genuinely reasoning or merely memorizing. In light of this, we design a preliminary study to quantify and delve into the abstract reasoning abilities of existing LLMs. Our findings reveal a substantial discrepancy between their general reasoning and abstract reasoning performances. To relieve this problem, we tailor an abstract reasoning dataset (AbsR) together with a meaningful learning paradigm to teach LLMs how to leverage generic facts for reasoning purposes. The results show that our approach not only boosts the general reasoning performance of LLMs but also makes considerable strides towards their capacity for abstract reasoning, moving beyond simple memorization or imitation to a more nuanced understanding and application of generic facts.
- [1350] arXiv:2403.09092 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News DetectionComments: Accepted by the ACM Web Conference 2024 (WWW 2024) oral, dataset available: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios.
- [1351] arXiv:2403.09113 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta LearningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large-scale pretraining followed by task-specific finetuning has achieved great success in various NLP tasks. Since finetuning all parameters of large pretrained models poses substantial computational and memory challenges, several efficient finetuning methods have been developed. Among them, low-rank adaptation (LoRA), which finetunes low-rank incremental update matrices on top of frozen pretrained weights, has proven particularly effective. Nonetheless, LoRA's uniform rank assignment across all layers, along with its reliance on an exhaustive search to find the best rank, leads to high computation costs and suboptimal finetuning performance. To address these limitations, we introduce AutoLoRA, a meta learning based framework for automatically identifying the optimal rank of each LoRA layer. AutoLoRA associates each rank-1 matrix in a low-rank update matrix with a selection variable, which determines whether the rank-1 matrix should be discarded. A meta learning based method is developed to learn these selection variables. The optimal rank is determined by thresholding the values of these variables. Our comprehensive experiments on natural language understanding, generation, and sequence labeling demonstrate the effectiveness of AutoLoRA.
- [1352] arXiv:2403.09131 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ProSwitch: Knowledge-Guided Instruction Tuning to Generate Professional and Non-Professional Styled TextComments: 8 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including text summarization and controlled text generation. However, studies into their capacity of switching between styles via fine-tuning remain underexplored. This study concentrates on textual professionalism and introduces a novel methodology, named ProSwitch, which equips a language model with the ability to produce both professional and non-professional responses through knowledge-guided instruction tuning. ProSwitch unfolds across three phases: data preparation for gathering domain knowledge and training corpus; instruction tuning for optimizing language models with multiple levels of instruction formats; and comprehensive evaluation for assessing the professionalism discrimination and reference-based quality of generated text. Comparative analysis of ProSwitch against both general and specialized language models reveals that our approach outperforms baselines in switching between professional and non-professional text generation.
- [1353] arXiv:2403.09141 (cross-list from cs.DC) [ pdf , ps , other ]
-
Title: Uncertainty Estimation in Multi-Agent Distributed Learning for AI-Enabled Edge DevicesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Initially considered as low-power units with limited autonomous processing, Edge IoT devices have seen a paradigm shift with the introduction of FPGAs and AI accelerators. This advancement has vastly amplified their computational capabilities, emphasizing the practicality of edge AI. Such progress introduces new challenges of optimizing AI tasks for the limitations of energy and network resources typical in Edge computing environments. Our study explores methods that enable distributed data processing through AI-enabled edge devices, enhancing collaborative learning capabilities. A key focus of our research is the challenge of determining confidence levels in learning outcomes, considering the spatial and temporal variability of data sets encountered by independent agents. To address this issue, we investigate the application of Bayesian neural networks, proposing a novel approach to manage uncertainty in distributed learning environments.
- [1354] arXiv:2403.09142 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: USimAgent: Large Language Models for Simulating Search UsersSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Due to the advantages in the cost-efficiency and reproducibility, user simulation has become a promising solution to the user-centric evaluation of information retrieval systems. Nonetheless, accurately simulating user search behaviors has long been a challenge, because users' actions in search are highly complex and driven by intricate cognitive processes such as learning, reasoning, and planning. Recently, Large Language Models (LLMs) have demonstrated remarked potential in simulating human-level intelligence and have been used in building autonomous agents for various tasks. However, the potential of using LLMs in simulating search behaviors has not yet been fully explored. In this paper, we introduce a LLM-based user search behavior simulator, USimAgent. The proposed simulator can simulate users' querying, clicking, and stopping behaviors during search, and thus, is capable of generating complete search sessions for specific search tasks. Empirical investigation on a real user behavior dataset shows that the proposed simulator outperforms existing methods in query generation and is comparable to traditional methods in predicting user clicks and stopping behaviors. These results not only validate the effectiveness of using LLMs for user simulation but also shed light on the development of a more robust and generic user simulators.
- [1355] arXiv:2403.09171 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ADEdgeDrop: Adversarial Edge Dropping for Robust Graph Neural NetworksZhaoliang Chen , Zhihao Wu , Ylli Sadikaj , Claudia Plant , Hong-Ning Dai , Shiping Wang , Wenzhong GuoSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Although Graph Neural Networks (GNNs) have exhibited the powerful ability to gather graph-structured information from neighborhood nodes via various message-passing mechanisms, the performance of GNNs is limited by poor generalization and fragile robustness caused by noisy and redundant graph data. As a prominent solution, Graph Augmentation Learning (GAL) has recently received increasing attention. Among prior GAL approaches, edge-dropping methods that randomly remove edges from a graph during training are effective techniques to improve the robustness of GNNs. However, randomly dropping edges often results in bypassing critical edges, consequently weakening the effectiveness of message passing. In this paper, we propose a novel adversarial edge-dropping method (ADEdgeDrop) that leverages an adversarial edge predictor guiding the removal of edges, which can be flexibly incorporated into diverse GNN backbones. Employing an adversarial training framework, the edge predictor utilizes the line graph transformed from the original graph to estimate the edges to be dropped, which improves the interpretability of the edge-dropping method. The proposed ADEdgeDrop is optimized alternately by stochastic gradient descent and projected gradient descent. Comprehensive experiments on six graph benchmark datasets demonstrate that the proposed ADEdgeDrop outperforms state-of-the-art baselines across various GNN backbones, demonstrating improved generalization and robustness.
- [1356] arXiv:2403.09184 (cross-list from eess.SY) [ pdf , ps , other ]
-
Title: Learning Algorithms for Verification of Markov Decision ProcessesTomáš Brázdil , Krishnendu Chatterjee , Martin Chmelik , Vojtěch Forejt , Jan Křetínský , Marta Kwiatkowska , Tobias Meggendorfer , David Parker , Mateusz UjmaSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Abstract: We present a general framework for applying learning algorithms and heuristical guidance to the verification of Markov decision processes (MDPs). The primary goal of our techniques is to improve performance by avoiding an exhaustive exploration of the state space, instead focussing on particularly relevant areas of the system, guided by heuristics. Our work builds on the previous results of Br{á}zdil et al., significantly extending it as well as refining several details and fixing errors.
The presented framework focuses on probabilistic reachability, which is a core problem in verification, and is instantiated in two distinct scenarios. The first assumes that full knowledge of the MDP is available, in particular precise transition probabilities. It performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the case where we may only sample the MDP without knowing the exact transition dynamics. Here, we obtain probabilistic guarantees, again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. In particular, the latter is an extension of statistical model-checking (SMC) for unbounded properties in MDPs. In contrast to other related approaches, we do not restrict our attention to time-bounded (finite-horizon) or discounted properties, nor assume any particular structural properties of the MDP. - [1357] arXiv:2403.09190 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Intention-aware Denoising Diffusion Model for Trajectory PredictionComments: 14 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Trajectory prediction is an essential component in autonomous driving, particularly for collision avoidance systems. Considering the inherent uncertainty of the task, numerous studies have utilized generative models to produce multiple plausible future trajectories for each agent. However, most of them suffer from restricted representation ability or unstable training issues. To overcome these limitations, we propose utilizing the diffusion model to generate the distribution of future trajectories. Two cruxes are to be settled to realize such an idea. First, the diversity of intention is intertwined with the uncertain surroundings, making the true distribution hard to parameterize. Second, the diffusion process is time-consuming during the inference phase, rendering it unrealistic to implement in a real-time driving system. We propose an Intention-aware denoising Diffusion Model (IDM), which tackles the above two problems. We decouple the original uncertainty into intention uncertainty and action uncertainty and model them with two dependent diffusion processes. To decrease the inference time, we reduce the variable dimensions in the intention-aware diffusion process and restrict the initial distribution of the action-aware diffusion process, which leads to fewer diffusion steps. To validate our approach, we conduct experiments on the Stanford Drone Dataset (SDD) and ETH/UCY dataset. Our methods achieve state-of-the-art results, with an FDE of 13.83 pixels on the SDD dataset and 0.36 meters on the ETH/UCY dataset. Compared with the original diffusion model, IDM reduces inference time by two-thirds. Interestingly, our experiments further reveal that introducing intention information is beneficial in modeling the diffusion process of fewer steps.
- [1358] arXiv:2403.09193 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Are Vision Language Models Texture or Shape Biased and Can We Steer Them?Paul Gavrikov , Jovita Lukasik , Steffen Jung , Robert Geirhos , Bianca Lamm , Muhammad Jehanzeb Mirza , Margret Keuper , Janis KeuperSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Abstract: Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.
- [1359] arXiv:2403.09199 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Customizing Segmentation Foundation Model via Prompt Learning for Instance SegmentationComments: 11 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recently, foundation models trained on massive datasets to adapt to a wide range of domains have attracted considerable attention and are actively being explored within the computer vision community. Among these, the Segment Anything Model (SAM) stands out for its remarkable progress in generalizability and flexibility for image segmentation tasks, achieved through prompt-based object mask generation. However, despite its strength, SAM faces two key limitations when applied to customized instance segmentation that segments specific objects or those in unique environments not typically present in the training data: 1) the ambiguity inherent in input prompts and 2) the necessity for extensive additional training to achieve optimal segmentation. To address these challenges, we propose a novel method, customized instance segmentation via prompt learning tailored to SAM. Our method involves a prompt learning module (PLM), which adjusts input prompts into the embedding space to better align with user intentions, thereby enabling more efficient training. Furthermore, we introduce a point matching module (PMM) to enhance the feature representation for finer segmentation by ensuring detailed alignment with ground truth boundaries. Experimental results on various customized instance segmentation scenarios demonstrate the effectiveness of the proposed method.
- [1360] arXiv:2403.09206 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Upper Bound of Bayesian Generalization Error in Partial Concept Bottleneck Model (CBM): Partial CBM outperforms naive CBMComments: 17 pages, 1 figure, submitted to TMLRSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
Abstract: Concept Bottleneck Model (CBM) is a methods for explaining neural networks. In CBM, concepts which correspond to reasons of outputs are inserted in the last intermediate layer as observed values. It is expected that we can interpret the relationship between the output and concept similar to linear regression. However, this interpretation requires observing all concepts and decreases the generalization performance of neural networks. Partial CBM (PCBM), which uses partially observed concepts, has been devised to resolve these difficulties. Although some numerical experiments suggest that the generalization performance of PCBMs is almost as high as that of the original neural networks, the theoretical behavior of its generalization error has not been yet clarified since PCBM is singular statistical model. In this paper, we reveal the Bayesian generalization error in PCBM with a three-layered and linear architecture. The result indcates that the structure of partially observed concepts decreases the Bayesian generalization error compared with that of CBM (full-observed concepts).
- [1361] arXiv:2403.09209 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: LAN: Learning Adaptive Neighbors for Real-Time Insider Threat DetectionComments: 13 pagesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Enterprises and organizations are faced with potential threats from insider employees that may lead to serious consequences. Previous studies on insider threat detection (ITD) mainly focus on detecting abnormal users or abnormal time periods (e.g., a week or a day). However, a user may have hundreds of thousands of activities in the log, and even within a day there may exist thousands of activities for a user, requiring a high investigation budget to verify abnormal users or activities given the detection results. On the other hand, existing works are mainly post-hoc methods rather than real-time detection, which can not report insider threats in time before they cause loss. In this paper, we conduct the first study towards real-time ITD at activity level, and present a fine-grained and efficient framework LAN. Specifically, LAN simultaneously learns the temporal dependencies within an activity sequence and the relationships between activities across sequences with graph structure learning. Moreover, to mitigate the data imbalance problem in ITD, we propose a novel hybrid prediction loss, which integrates self-supervision signals from normal activities and supervision signals from abnormal activities into a unified loss for anomaly detection. We evaluate the performance of LAN on two widely used datasets, i.e., CERT r4.2 and CERT r5.2. Extensive and comparative experiments demonstrate the superiority of LAN, outperforming 9 state-of-the-art baselines by at least 9.92% and 6.35% in AUC for real-time ITD on CERT r4.2 and r5.2, respectively. Moreover, LAN can be also applied to post-hoc ITD, surpassing 8 competitive baselines by at least 7.70% and 4.03% in AUC on two datasets. Finally, the ablation study, parameter analysis, and compatibility analysis evaluate the impact of each module and hyper-parameter in LAN. The source code can be obtained from this https URL .
- [1362] arXiv:2403.09215 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: On the Laplace Approximation as Model Selection Criterion for Gaussian ProcessesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Model selection aims to find the best model in terms of accuracy, interpretability or simplicity, preferably all at once. In this work, we focus on evaluating model performance of Gaussian process models, i.e. finding a metric that provides the best trade-off between all those criteria. While previous work considers metrics like the likelihood, AIC or dynamic nested sampling, they either lack performance or have significant runtime issues, which severely limits applicability. We address these challenges by introducing multiple metrics based on the Laplace approximation, where we overcome a severe inconsistency occuring during naive application of the Laplace approximation. Experiments show that our metrics are comparable in quality to the gold standard dynamic nested sampling without compromising for computational speed. Our model selection criteria allow significantly faster and high quality model selection of Gaussian process models.
- [1363] arXiv:2403.09227 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic SimulationChengshu Li , Ruohan Zhang , Josiah Wong , Cem Gokmen , Sanjana Srivastava , Roberto Martín-Martín , Chen Wang , Gabrael Levine , Wensi Ai , Benjamin Martinez , Hang Yin , Michael Lingelbach , Minjune Hwang , Ayano Hiranaka , Sujay Garlanka , Arman Aydin , Sharon Lee , Jiankai Sun , Mona Anvari , Manasi Sharma , Dhruva Bansal , Samuel Hunter , Kyu-Young Kim , Alan Lou , Caleb R Matthews , Ivan Villa-Renteria , Jerry Huayang Tang , Claire Tang , Fei Xia , Yunzhu Li , Silvio Savarese , Hyowon Gweon , C. Karen Liu , Jiajun Wu , Li Fei-FeiComments: A preliminary version was published at 6th Conference on Robot Learning (CoRL 2022)Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with rich physical and semantic properties. The second is OMNIGIBSON, a novel simulation environment that supports these activities via realistic physics simulation and rendering of rigid bodies, deformable bodies, and liquids. Our experiments indicate that the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions. To calibrate the simulation-to-reality gap of BEHAVIOR-1K, we provide an initial study on transferring solutions learned with a mobile manipulator in a simulated apartment to its real-world counterpart. We hope that BEHAVIOR-1K's human-grounded nature, diversity, and realism make it valuable for embodied AI and robot learning research. Project website: this https URL .
- [1364] arXiv:2403.09288 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question AnsweringComments: 6 pages, 3 figures, accepted by 2024 IEEE International Conference on Multimedia and ExpoSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.
- [1365] arXiv:2403.09290 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SELECTOR: Heterogeneous graph network with convolutional masked autoencoder for multimodal robust prediction of cancer survivalLiangrui Pan , Yijun Peng , Yan Li , Xiang Wang , Wenjuan Liu , Liwen Xu , Qingchun Liang , Shaoliang PengComments: Accepted on Computers in Biology and MedicineSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Accurately predicting the survival rate of cancer patients is crucial for aiding clinicians in planning appropriate treatment, reducing cancer-related medical expenses, and significantly enhancing patients' quality of life. Multimodal prediction of cancer patient survival offers a more comprehensive and precise approach. However, existing methods still grapple with challenges related to missing multimodal data and information interaction within modalities. This paper introduces SELECTOR, a heterogeneous graph-aware network based on convolutional mask encoders for robust multimodal prediction of cancer patient survival. SELECTOR comprises feature edge reconstruction, convolutional mask encoder, feature cross-fusion, and multimodal survival prediction modules. Initially, we construct a multimodal heterogeneous graph and employ the meta-path method for feature edge reconstruction, ensuring comprehensive incorporation of feature information from graph edges and effective embedding of nodes. To mitigate the impact of missing features within the modality on prediction accuracy, we devised a convolutional masked autoencoder (CMAE) to process the heterogeneous graph post-feature reconstruction. Subsequently, the feature cross-fusion module facilitates communication between modalities, ensuring that output features encompass all features of the modality and relevant information from other modalities. Extensive experiments and analysis on six cancer datasets from TCGA demonstrate that our method significantly outperforms state-of-the-art methods in both modality-missing and intra-modality information-confirmed cases. Our codes are made available at this https URL .
- [1366] arXiv:2403.09313 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper we present YOLOX-ViT, a novel object detection model, and investigate the efficacy of knowledge distillation for model size reduction without sacrificing performance. Focused on underwater robotics, our research addresses key questions about the viability of smaller models and the impact of the visual transformer layer in YOLOX. Furthermore, we introduce a new side-scan sonar image dataset, and use it to evaluate our object detector's performance. Results show that knowledge distillation effectively reduces false positives in wall detection. Additionally, the introduced visual transformer layer significantly improves object detection accuracy in the underwater environment. The source code of the knowledge distillation in the YOLOX-ViT is at this https URL .
- [1367] arXiv:2403.09317 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SD-Net: Symmetric-Aware Keypoint Prediction and Domain Adaptation for 6D Pose Estimation In Bin-picking ScenariosSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Despite the success in 6D pose estimation in bin-picking scenarios, existing methods still struggle to produce accurate prediction results for symmetry objects and real world scenarios. The primary bottlenecks include 1) the ambiguity keypoints caused by object symmetries; 2) the domain gap between real and synthetic data. To circumvent these problem, we propose a new 6D pose estimation network with symmetric-aware keypoint prediction and self-training domain adaptation (SD-Net). SD-Net builds on pointwise keypoint regression and deep hough voting to perform reliable detection keypoint under clutter and occlusion. Specifically, at the keypoint prediction stage, we designe a robust 3D keypoints selection strategy considering the symmetry class of objects and equivalent keypoints, which facilitate locating 3D keypoints even in highly occluded scenes. Additionally, we build an effective filtering algorithm on predicted keypoint to dynamically eliminate multiple ambiguity and outlier keypoint candidates. At the domain adaptation stage, we propose the self-training framework using a student-teacher training scheme. To carefully distinguish reliable predictions, we harnesses a tailored heuristics for 3D geometry pseudo labelling based on semi-chamfer distance. On public Sil'eane dataset, SD-Net achieves state-of-the-art results, obtaining an average precision of 96%. Testing learning and generalization abilities on public Parametric datasets, SD-Net is 8% higher than the state-of-the-art method. The code is available at this https URL .
- [1368] arXiv:2403.09326 (cross-list from cs.GR) [ pdf , ps , html , other ]
-
Title: HeadEvolver: Text to Head Avatars via Locally Learnable Mesh DeformationDuotun Wang , Hengyu Meng , Zeyu Cai , Zhijing Shao , Qianxi Liu , Lin Wang , Mingming Fan , Ying Shan , Xiaohang Zhan , Zeyu WangComments: 12 pages, 15 figuresSubjects: Graphics (cs.GR) ; Artificial Intelligence (cs.AI)
Abstract: We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance. HeadEvolver uses locally learnable mesh deformation from a template head mesh, producing high-quality digital assets for detail-preserving editing and animation. To tackle the challenges of lacking fine-grained and semantic-aware local shape control in global deformation through Jacobians, we introduce a trainable parameter as a weighting factor for the Jacobian at each triangle to adaptively change local shapes while maintaining global correspondences and facial features. Moreover, to ensure the coherence of the resulting shape and appearance from different viewpoints, we use pretrained image diffusion models for differentiable rendering with regularization terms to refine the deformation under text guidance. Extensive experiments demonstrate that our method can generate diverse head avatars with an articulated mesh that can be edited seamlessly in 3D graphics software, facilitating downstream applications such as more efficient animation with inherited blend shapes and semantic consistency.
- [1369] arXiv:2403.09333 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-ReferringComments: Tech report working in progress. Codes, models and datasets will be released at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at this https URL .
- [1370] arXiv:2403.09338 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LocalMamba: Visual State Space Model with Windowed Selective ScanSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: this https URL .
- [1371] arXiv:2403.09344 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SketchINR: A First Look into Sketches as Implicit Neural RepresentationsHmrishav Bandyopadhyay , Ayan Kumar Bhunia , Pinaki Nath Chowdhury , Aneeshan Sain , Tao Xiang , Timothy Hospedales , Yi-Zhe SongComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We propose SketchINR, to advance the representation of vector sketches with implicit neural models. A variable length vector sketch is compressed into a latent space of fixed dimension that implicitly encodes the underlying shape as a function of time and strokes. The learned function predicts the $xy$ point coordinates in a sketch at each time and stroke. Despite its simplicity, SketchINR outperforms existing representations at multiple tasks: (i) Encoding an entire sketch dataset into a fixed size latent vector, SketchINR gives $60\times$ and $10\times$ data compression over raster and vector sketches, respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity representation than other learned vector sketch representations, and is uniquely able to scale to complex vector sketches such as FS-COCO. (iii) SketchINR supports parallelisation that can decode/render $\sim$$100\times$ faster than other learned vector representations such as SketchRNN. (iv) SketchINR, for the first time, emulates the human ability to reproduce a sketch with varying abstraction in terms of number and complexity of strokes. As a first look at implicit sketches, SketchINR's compact high-fidelity representation will support future work in modelling long and complex sketches.
- [1372] arXiv:2403.09346 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-InstructionsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Large Vision-Language Models (LVLMs) have shown significant progress in well responding to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions (AVIs), including four types of image-based AVIs, ten types of text-based AVIs, and nine types of content bias AVIs (such as gender, violence, cultural, and racial biases, among others). We generate 260K AVIs encompassing five categories of multimodal capabilities (nine tasks) and content bias. We then conduct a comprehensive evaluation involving 14 open-source LVLMs to assess their performance. AVIBench also serves as a convenient tool for practitioners to evaluate the robustness of LVLMs against AVIs. Our findings and extensive experimental results shed light on the vulnerabilities of LVLMs, and highlight that inherent biases exist even in advanced closed-source LVLMs like GeminiProVision and GPT-4V. This underscores the importance of enhancing the robustness, security, and fairness of LVLMs. The source code and benchmark will be made publicly available.
- [1373] arXiv:2403.09359 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object DetectionComments: Accepted by CVPR 2024. Link: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Domain adaptation for object detection typically entails transferring knowledge from one visible domain to another visible domain. However, there are limited studies on adapting from the visible to the thermal domain, because the domain gap between the visible and thermal domains is much larger than expected, and traditional domain adaptation can not successfully facilitate learning in this situation. To overcome this challenge, we propose a Distinctive Dual-Domain Teacher (D3T) framework that employs distinct training paradigms for each domain. Specifically, we segregate the source and target training sets for building dual-teachers and successively deploy exponential moving average to the student model to individual teachers of each domain. The framework further incorporates a zigzag learning method between dual teachers, facilitating a gradual transition from the visible to thermal domains during training. We validate the superiority of our method through newly designed experimental protocols with well-known thermal datasets, i.e., FLIR and KAIST. Source code is available at this https URL .
- [1374] arXiv:2403.09407 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: LM2D: Lyrics- and Music-Driven Dance SynthesisSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Abstract: Dance typically involves professional choreography with complex movements that follow a musical rhythm and can also be influenced by lyrical content. The integration of lyrics in addition to the auditory dimension, enriches the foundational tone and makes motion generation more amenable to its semantic meanings. However, existing dance synthesis methods tend to model motions only conditioned on audio signals. In this work, we make two contributions to bridge this gap. First, we propose LM2D, a novel probabilistic architecture that incorporates a multimodal diffusion model with consistency distillation, designed to create dance conditioned on both music and lyrics in one diffusion generation step. Second, we introduce the first 3D dance-motion dataset that encompasses both music and lyrics, obtained with pose estimation technologies. We evaluate our model against music-only baseline models with objective metrics and human evaluations, including dancers and choreographers. The results demonstrate LM2D is able to produce realistic and diverse dance matching both lyrics and music. A video summary can be accessed at: this https URL .
- [1375] arXiv:2403.09409 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: "Like a Nesting Doll": Analyzing Recursion Analogies Generated by CS Students using Large Language ModelsSeth Bernstein , Paul Denny , Juho Leinonen , Lauren Kan , Arto Hellas , Matt Littlefield , Sami Sarsa , Stephen MacNeilComments: 7 pages, 2 figures, ITiCSE 2024 preprintSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Grasping complex computing concepts often poses a challenge for students who struggle to anchor these new ideas to familiar experiences and understandings. To help with this, a good analogy can bridge the gap between unfamiliar concepts and familiar ones, providing an engaging way to aid understanding. However, creating effective educational analogies is difficult even for experienced instructors. We investigate to what extent large language models (LLMs), specifically ChatGPT, can provide access to personally relevant analogies on demand. Focusing on recursion, a challenging threshold concept, we conducted an investigation analyzing the analogies generated by more than 350 first-year computing students. They were provided with a code snippet and tasked to generate their own recursion-based analogies using ChatGPT, optionally including personally relevant topics in their prompts. We observed a great deal of diversity in the analogies produced with student-prescribed topics, in contrast to the otherwise generic analogies, highlighting the value of student creativity when working with LLMs. Not only did students enjoy the activity and report an improved understanding of recursion, but they described more easily remembering analogies that were personally and culturally relevant.
- [1376] arXiv:2403.09410 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context OptimizationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Utilizing potent representations of the large vision-language models (VLMs) to accomplish various downstream tasks has attracted increasing attention. Within this research field, soft prompt learning has become a representative approach for efficiently adapting VLMs such as CLIP, to tasks like image classification. However, most existing prompt learning methods learn text tokens that are unexplainable, which cannot satisfy the stringent interpretability requirements of Explainable Artificial Intelligence (XAI) in high-stakes scenarios like healthcare. To address this issue, we propose a novel explainable prompt learning framework that leverages medical knowledge by aligning the semantics of images, learnable prompts, and clinical concept-driven prompts at multiple granularities. Moreover, our framework addresses the lack of valuable concept annotations by eliciting knowledge from large language models and offers both visual and textual explanations for the prompts. Extensive experiments and explainability analyses conducted on various datasets, with and without concept labels, demonstrate that our method simultaneously achieves superior diagnostic performance, flexibility, and interpretability, shedding light on the effectiveness of foundation models in facilitating XAI. The code will be made publically available.
- [1377] arXiv:2403.09412 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OpenGraph: Open-Vocabulary Hierarchical 3D Graph Representation in Large-Scale Outdoor EnvironmentsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Environment representations endowed with sophisticated semantics are pivotal for facilitating seamless interaction between robots and humans, enabling them to effectively carry out various tasks. Open-vocabulary maps, powered by Visual-Language models (VLMs), possess inherent advantages, including zero-shot learning and support for open-set classes. However, existing open-vocabulary maps are primarily designed for small-scale environments, such as desktops or rooms, and are typically geared towards limited-area tasks involving robotic indoor navigation or in-place manipulation. They face challenges in direct generalization to outdoor environments characterized by numerous objects and complex tasks, owing to limitations in both understanding level and map structure. In this work, we propose OpenGraph, the first open-vocabulary hierarchical graph representation designed for large-scale outdoor environments. OpenGraph initially extracts instances and their captions from visual images, enhancing textual reasoning by encoding them. Subsequently, it achieves 3D incremental object-centric mapping with feature embedding by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results from public dataset SemanticKITTI demonstrate that OpenGraph achieves the highest segmentation and query accuracy. The source code of OpenGraph is publicly available at this https URL .
- [1378] arXiv:2403.09422 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Mitigating attribute amplification in counterfactual image generationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Causal generative modelling is gaining interest in medical imaging due to its ability to answer interventional and counterfactual queries. Most work focuses on generating counterfactual images that look plausible, using auxiliary classifiers to enforce effectiveness of simulated interventions. We investigate pitfalls in this approach, discovering the issue of attribute amplification, where unrelated attributes are spuriously affected during interventions, leading to biases across protected characteristics and disease status. We show that attribute amplification is caused by the use of hard labels in the counterfactual training process and propose soft counterfactual fine-tuning to mitigate this issue. Our method substantially reduces the amplification effect while maintaining effectiveness of generated images, demonstrated on a large chest X-ray dataset. Our work makes an important advancement towards more faithful and unbiased causal modelling in medical imaging.
- [1379] arXiv:2403.09439 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: 3D-SceneDreamer: Text-Driven 3D-Consistent Scene GenerationComments: 11 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
- [1380] arXiv:2403.09442 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: LLM-based agents for automating the enhancement of user story quality: An early reportComments: 16 pages, 5 figures, 2 tablesSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: In agile software development, maintaining high-quality user stories is crucial, but also challenging. This study explores the use of large language models to automatically improve the user story quality in Austrian Post Group IT agile teams. We developed a reference model for an Autonomous LLM-based Agent System and implemented it at the company. The quality of user stories in the study and the effectiveness of these agents for user story quality improvement was assessed by 11 participants across six agile teams. Our findings demonstrate the potential of LLMs in improving user story quality, contributing to the research on AI role in agile development, and providing a practical example of the transformative impact of AI in an industry setting.
- [1381] arXiv:2403.09472 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Easy-to-Hard Generalization: Scalable Alignment Beyond Human SupervisionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans? This paper answers this question in the context of tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e.g., level 1-3 MATH problems), which we term as \textit{easy-to-hard generalization}. Our key insight is that an evaluator (reward model) trained on supervisions for easier tasks can be effectively used for scoring candidate solutions of harder tasks and hence facilitating easy-to-hard generalization over different levels of tasks. Based on this insight, we propose a novel approach to scalable alignment, which firstly trains the process-supervised reward models on easy problems (e.g., level 1-3), and then uses them to evaluate the performance of policy models on hard problems. We show that such \textit{easy-to-hard generalization from evaluators} can enable \textit{easy-to-hard generalizations in generators} either through re-ranking or reinforcement learning (RL). Notably, our process-supervised 7b RL model achieves an accuracy of 34.0\% on MATH500, despite only using human supervision on easy problems. Our approach suggests a promising path toward AI systems that advance beyond the frontier of human supervision.
- [1382] arXiv:2403.09479 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Laying the Foundation First? Investigating the Generalization from Atomic Skills to Complex Reasoning TasksSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Current language models have demonstrated their capability to develop basic reasoning, but struggle in more complicated reasoning tasks that require a combination of atomic skills, such as math word problem requiring skills like arithmetic and unit conversion. Previous methods either do not improve the inherent atomic skills of models or not attempt to generalize the atomic skills to complex reasoning tasks. In this paper, we first propose a probing framework to investigate whether the atomic skill can spontaneously generalize to complex reasoning tasks. Then, we introduce a hierarchical curriculum learning training strategy to achieve better skill generalization. In our experiments, we find that atomic skills can not spontaneously generalize to compositional tasks. By leveraging hierarchical curriculum learning, we successfully induce generalization, significantly improve the performance of open-source LMs on complex reasoning tasks. Promisingly, the skill generalization exhibit effective in cross-dataset and cross-domain scenarios. Complex reasoning can also help enhance atomic skills. Our findings offer valuable guidance for designing better training strategies for complex reasoning tasks.
- [1383] arXiv:2403.09480 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: What Sketch Explainability Really Means for Downstream TasksHmrishav Bandyopadhyay , Pinaki Nath Chowdhury , Ayan Kumar Bhunia , Aneeshan Sain , Tao Xiang , Yi-Zhe SongComments: CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we explore the unique modality of sketch for explainability, emphasising the profound impact of human strokes compared to conventional pixel-oriented studies. Beyond explanations of network behavior, we discern the genuine implications of explainability across diverse downstream sketch-related tasks. We propose a lightweight and portable explainability solution -- a seamless plugin that integrates effortlessly with any pre-trained model, eliminating the need for re-training. Demonstrating its adaptability, we present four applications: highly studied retrieval and generation, and completely novel assisted drawing and sketch adversarial attacks. The centrepiece to our solution is a stroke-level attribution map that takes different forms when linked with downstream tasks. By addressing the inherent non-differentiability of rasterisation, we enable explanations at both coarse stroke level (SLA) and partial stroke level (P-SLA), each with its advantages for specific downstream tasks.
- [1384] arXiv:2403.09488 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Rectifying Demonstration Shortcut in In-Context LearningComments: NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) are able to solve various tasks with only a few demonstrations utilizing their in-context learning (ICL) abilities. However, LLMs often rely on their pre-trained semantic priors of demonstrations rather than on the input-label relationships to proceed with ICL prediction. In this work, we term this phenomenon as the 'Demonstration Shortcut'. While previous works have primarily focused on improving ICL prediction results for predefined tasks, we aim to rectify the Demonstration Shortcut, thereby enabling the LLM to effectively learn new input-label relationships from demonstrations. To achieve this, we introduce In-Context Calibration, a demonstration-aware calibration method. We evaluate the effectiveness of the proposed method in two settings: (1) the Original ICL Task using the standard label space and (2) the Task Learning setting, where the label space is replaced with semantically unrelated tokens. In both settings, In-Context Calibration demonstrates substantial improvements, with results generalized across three LLM families (OPT, GPT, and Llama2) under various configurations.
- [1385] arXiv:2403.09498 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: From Skepticism to Acceptance: Simulating the Attitude Dynamics Toward Fake NewsSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: In the digital era, the rapid propagation of fake news and rumors via social networks brings notable societal challenges and impacts public opinion regulation. Traditional fake news modeling typically forecasts the general popularity trends of different groups or numerically represents opinions shift. However, these methods often oversimplify real-world complexities and overlook the rich semantic information of news text. The advent of large language models (LLMs) provides the possibility of modeling subtle dynamics of opinion. Consequently, in this work, we introduce a Fake news Propagation Simulation framework (FPS) based on LLM, which studies the trends and control of fake news propagation in detail. Specifically, each agent in the simulation represents an individual with a distinct personality. They are equipped with both short-term and long-term memory, as well as a reflective mechanism to mimic human-like thinking. Every day, they engage in random opinion exchanges, reflect on their thinking, and update their opinions. Our simulation results uncover patterns in fake news propagation related to topic relevance, and individual traits, aligning with real-world observations. Additionally, we evaluate various intervention strategies and demonstrate that early and appropriately frequent interventions strike a balance between governance cost and effectiveness, offering valuable insights for practical applications. Our study underscores the significant utility and potential of LLMs in combating fake news.
- [1386] arXiv:2403.09499 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Reinforcement Learning Approach to Dairy Farm Battery Management using Q LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Dairy farming consumes a significant amount of energy, making it an energy-intensive sector within agriculture. Integrating renewable energy generation into dairy farming could help address this challenge. Effective battery management is important for integrating renewable energy generation. Managing battery charging and discharging poses significant challenges because of fluctuations in electrical consumption, the intermittent nature of renewable energy generation, and fluctuations in energy prices. Artificial Intelligence (AI) has the potential to significantly improve the use of renewable energy in dairy farming, however, there is limited research conducted in this particular domain. This research considers Ireland as a case study as it works towards attaining its 2030 energy strategy centered on the utilization of renewable sources. This study proposes a Q-learning-based algorithm for scheduling battery charging and discharging in a dairy farm setting. This research also explores the effect of the proposed algorithm by adding wind generation data and considering additional case studies. The proposed algorithm reduces the cost of imported electricity from the grid by 13.41\%, peak demand by 2\%, and 24.49\% when utilizing wind generation. These results underline how reinforcement learning is highly effective in managing batteries in the dairy farming sector.
- [1387] arXiv:2403.09502 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: EquiAV: Leveraging Equivariance for Audio-Visual Contrastive LearningComments: 14 pages, 3 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks.
- [1388] arXiv:2403.09506 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Don't Judge by the Look: Towards Motion Coherent Video RepresentationComments: Accepted by ICLR2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video understanding and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video understanding, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at this https URL .
- [1389] arXiv:2403.09513 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield PromptingComments: Multimodal Large Language Models Defense, 25 PagesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), the imperative to ensure their safety has become increasingly pronounced. However, with the integration of additional modalities, MLLMs are exposed to new vulnerabilities, rendering them prone to structured-based jailbreak attacks, where semantic content (e.g., "harmful text") has been injected into the images to mislead MLLMs. In this work, we aim to defend against such threats. Specifically, we propose \textbf{Ada}ptive \textbf{Shield} Prompting (\textbf{AdaShield}), which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks without fine-tuning MLLMs or training additional modules (e.g., post-stage content detector). Initially, we present a manually designed static defense prompt, which thoroughly examines the image and instruction content step by step and specifies response methods to malicious queries. Furthermore, we introduce an adaptive auto-refinement framework, consisting of a target MLLM and a LLM-based defense prompt generator (Defender). These components collaboratively and iteratively communicate to generate a defense prompt. Extensive experiments on the popular structure-based jailbreak attacks and benign datasets show that our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks without compromising the model's general capabilities evaluated on standard benign tasks. Our code is available at this https URL .
- [1390] arXiv:2403.09530 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision UnderstandingChris Kelly , Luhui Hu , Jiayin Hu , Yu Tian , Deshun Yang , Bang Yang , Cindy Yang , Zihao Li , Zaoshan Huang , Yuexian ZouComments: 12 pages, 7 figures, pending conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
Abstract: The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous days are focused on image detection, classification based on well-defined objects. Large language models (LLMs) introduces the transformation from nature language to visual objects, which present the visual layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs, while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models and algorithms to convert 2D images to their 3D representations. However, the mismatching between the algorithms with the problem could lead to undesired results. In response to this challenge, we propose an unified VisionGPT-3D framework to consolidate the state-of-the-art vision models, thereby facilitating the development of vision-oriented AI. VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts.
Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agent - [1391] arXiv:2403.09539 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Logits of API-Protected LLMs Leak Proprietary InformationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: The commercialization of large language models (LLMs) has led to the common practice of high-level API-only access to proprietary models. In this work, we show that even with a conservative assumption about the model architecture, it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries (e.g., costing under $1,000 for OpenAI's gpt-3.5-turbo). Our findings are centered on one key observation: most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We show that this lends itself to a model image or a model signature which unlocks several capabilities with affordable cost: efficiently discovering the LLM's hidden size, obtaining full-vocabulary outputs, detecting and disambiguating different model updates, identifying the source LLM given a single full LLM output, and even estimating the output layer parameters. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096. Lastly, we discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.
- [1392] arXiv:2403.09549 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generalizing Denoising to Non-Equilibrium Structures Improves Equivariant Force FieldsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
Abstract: Understanding the interactions of atoms such as forces in 3D atomistic systems is fundamental to many applications like molecular dynamics and catalyst design. However, simulating these interactions requires compute-intensive ab initio calculations and thus results in limited data for training neural networks. In this paper, we propose to use denoising non-equilibrium structures (DeNS) as an auxiliary task to better leverage training data and improve performance. For training with DeNS, we first corrupt a 3D structure by adding noise to its 3D coordinates and then predict the noise. Different from previous works on denoising, which are limited to equilibrium structures, the proposed method generalizes denoising to a much larger set of non-equilibrium structures. The main difference is that a non-equilibrium structure does not correspond to local energy minima and has non-zero forces, and therefore it can have many possible atomic positions compared to an equilibrium structure. This makes denoising non-equilibrium structures an ill-posed problem since the target of denoising is not uniquely defined. Our key insight is to additionally encode the forces of the original non-equilibrium structure to specify which non-equilibrium structure we are denoising. Concretely, given a corrupted non-equilibrium structure and the forces of the original one, we predict the non-equilibrium structure satisfying the input forces instead of any arbitrary structures. Since DeNS requires encoding forces, DeNS favors equivariant networks, which can easily incorporate forces and other higher-order tensors in node embeddings. We study the effectiveness of training equivariant networks with DeNS on OC20, OC22 and MD17 datasets and demonstrate that DeNS can achieve new state-of-the-art results on OC20 and OC22 and significantly improve training efficiency on MD17.
- [1393] arXiv:2403.09565 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Welcome Your New AI Teammate: On Safety Analysis by Leashing Large Language ModelsComments: Accepted in CAIN 2024, 6 pages, 1 figureSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: DevOps is a necessity in many industries, including the development of Autonomous Vehicles. In those settings, there are iterative activities that reduce the speed of SafetyOps cycles. One of these activities is "Hazard Analysis & Risk Assessment" (HARA), which is an essential step to start the safety requirements specification. As a potential approach to increase the speed of this step in SafetyOps, we have delved into the capabilities of Large Language Models (LLMs).
Our objective is to systematically assess their potential for application in the field of safety engineering. To that end, we propose a framework to support a higher degree of automation of HARA with LLMs. Despite our endeavors to automate as much of the process as possible, expert review remains crucial to ensure the validity and correctness of the analysis results, with necessary modifications made accordingly. - [1394] arXiv:2403.09567 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Enhancing Trust in Autonomous Agents: An Architecture for Accountability and Explainability through Blockchain and Large Language ModelsLaura Fernández-Becerra , Miguel Ángel González-Santamarta , Ángel Manuel Guerrero-Higueras , Francisco Javier Rodríguez-Lera , Vicente Matellán OliveraComments: 23 pages, 13 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The deployment of autonomous agents in environments involving human interaction has increasingly raised security concerns. Consequently, understanding the circumstances behind an event becomes critical, requiring the development of capabilities to justify their behaviors to non-expert users. Such explanations are essential in enhancing trustworthiness and safety, acting as a preventive measure against failures, errors, and misunderstandings. Additionally, they contribute to improving communication, bridging the gap between the agent and the user, thereby improving the effectiveness of their interactions. This work presents an accountability and explainability architecture implemented for ROS-based mobile robots. The proposed solution consists of two main components. Firstly, a black box-like element to provide accountability, featuring anti-tampering properties achieved through blockchain technology. Secondly, a component in charge of generating natural language explanations by harnessing the capabilities of Large Language Models (LLMs) over the data contained within the previously mentioned black box. The study evaluates the performance of our solution in three different scenarios, each involving autonomous agent navigation functionalities. This evaluation includes a thorough examination of accountability and explainability metrics, demonstrating the effectiveness of our approach in using accountable data from robot actions to obtain coherent, accurate and understandable explanations, even when facing challenges inherent in the use of autonomous agents in real-world scenarios.
- [1395] arXiv:2403.09603 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Optimistic Verifiable Training by Controlling Hardware NondeterminismComments: 11 pages, 5 figures, preprintSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The increasing compute demands of AI systems has led to the emergence of services that train models on behalf of clients lacking necessary resources. However, ensuring correctness of training and guarding against potential training-time attacks, such as data poisoning, poses challenges. Existing works on verifiable training largely fall into two classes: proof-based systems, which struggle to scale due to requiring cryptographic techniques, and "optimistic" methods that consider a trusted third-party auditor who replicates the training process. A key challenge with the latter is that hardware nondeterminism between GPU types during training prevents an auditor from replicating the training process exactly, and such schemes are therefore non-robust. We propose a method that combines training in a higher precision than the target model, rounding after intermediate computation steps, and storing rounding decisions based on an adaptive thresholding procedure, to successfully control for nondeterminism. Across three different NVIDIA GPUs (A40, Titan XP, RTX 2080 Ti), we achieve exact training replication at FP32 precision for both full-training and fine-tuning of ResNet-50 (23M) and GPT-2 (117M) models. Our verifiable training scheme significantly decreases the storage and time costs compared to proof-based systems.
- [1396] arXiv:2403.09605 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Counterfactual contrastive learning: robust representations via causal image synthesisComments: Code available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Contrastive pretraining is well-known to improve downstream task performance and model generalisation, especially in limited label settings. However, it is sensitive to the choice of augmentation pipeline. Positive pairs should preserve semantic information while destroying domain-specific information. Standard augmentation pipelines emulate domain-specific changes with pre-defined photometric transformations, but what if we could simulate realistic domain changes instead? In this work, we show how to utilise recent progress in counterfactual image generation to this effect. We propose CF-SimCLR, a counterfactual contrastive learning approach which leverages approximate counterfactual inference for positive pair creation. Comprehensive evaluation across five datasets, on chest radiography and mammography, demonstrates that CF-SimCLR substantially improves robustness to acquisition shift with higher downstream performance on both in- and out-of-distribution data, particularly for domains which are under-represented during training.
- [1397] arXiv:2403.09606 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Large Language Models and Causal Inference in Collaboration: A Comprehensive SurveyXiaoyu Liu , Paiheng Xu , Junda Wu , Jiaxin Yuan , Yifan Yang , Yuhang Zhou , Fuxiao Liu , Tianrui Guan , Haoliang Wang , Tong Yu , Julian McAuley , Wei Ai , Furong HuangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing causal relationships among variables. The emergence of generative Large Language Models (LLMs) has significantly impacted various NLP domains, particularly through their advanced reasoning capabilities. This survey focuses on evaluating and improving LLMs from a causal view in the following areas: understanding and improving the LLMs' reasoning capacity, addressing fairness and safety issues in LLMs, complementing LLMs with explanations, and handling multimodality. Meanwhile, LLMs' strong reasoning capacities can in turn contribute to the field of causal inference by aiding causal relationship discovery and causal effect estimations. This review explores the interplay between causal inference frameworks and LLMs from both perspectives, emphasizing their collective potential to further the development of more advanced and equitable artificial intelligence systems.
- [1398] arXiv:2403.09621 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement LearningComments: 53 pages, 1 figure, 1 tableSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Distributionally robust offline reinforcement learning (RL), which seeks robust policy training against environment perturbation by modeling dynamics uncertainty, calls for function approximations when facing large state-action spaces. However, the consideration of dynamics uncertainty introduces essential nonlinearity and computational burden, posing unique challenges for analyzing and practically employing function approximation. Focusing on a basic setting where the nominal model and perturbed models are linearly parameterized, we propose minimax optimal and computationally efficient algorithms realizing function approximation and initiate the study on instance-dependent suboptimality analysis in the context of robust offline RL. Our results uncover that function approximation in robust offline RL is essentially distinct from and probably harder than that in standard offline RL. Our algorithms and theoretical results crucially depend on a variety of new techniques, involving a novel function approximation mechanism incorporating variance information, a new procedure of suboptimality and estimation uncertainty decomposition, a quantification of the robust value function shrinkage, and a meticulously designed family of hard instances, which might be of independent interest.
- [1399] arXiv:2403.09629 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Quiet-STaR: Language Models Can Teach Themselves to Think Before SpeakingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting -- ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%$\rightarrow$10.9%) and CommonsenseQA (36.3%$\rightarrow$47.2%) and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way.
- [1400] arXiv:2403.09631 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: 3D-VLA: A 3D Vision-Language-Action Generative World ModelHaoyu Zhen , Xiaowen Qiu , Peihao Chen , Jincheng Yang , Xin Yan , Yilun Du , Yining Hong , Chuang GanComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Abstract: Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.
- [1401] arXiv:2403.09635 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Transformers Get Stable: An End-to-End Signal Propagation Theory for Language ModelsComments: Akhil Kedia, Mohd Abbas Zaidi, Sushil Khyalia equal contribution. Source code is available at this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores. We also propose DeepScaleLM, an initialization and scaling scheme that conserves unit output/gradient moments throughout the model, enabling the training of very deep models with 100s of layers. We find that transformer models could be much deeper - our deep models with fewer parameters outperform shallow models in Language Modeling, Speech Translation, and Image Classification, across Encoder-only, Decoder-only and Encoder-Decoder variants, for both Pre-LN and Post-LN transformers, for multiple datasets and model sizes. These improvements also translate into improved performance on downstream Question Answering tasks and improved robustness for image classification.
- [1402] arXiv:2403.09646 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: On Unsupervised Image-to-image translation and GAN stabilitySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: The problem of image-to-image translation is one that is intruiging and challenging at the same time, for the impact potential it can have on a wide variety of other computer vision applications like colorization, inpainting, segmentation and others. Given the high-level of sophistication needed to extract patterns from one domain and successfully applying them to another, especially, in a completely unsupervised (unpaired) manner, this problem has gained much attention as of the last few years. It is one of the first problems where successful applications to deep generative models, and especially Generative Adversarial Networks achieved astounding results that are actually of realworld impact, rather than just a show of theoretical prowess; the such that has been dominating the GAN world. In this work, we study some of the failure cases of a seminal work in the field, CycleGAN [1] and hypothesize that they are GAN-stability related, and propose two general models to try to alleviate these problems. We also reach the same conclusion of the problem being ill-posed that has been also circulating in the literature lately.
- [1403] arXiv:2403.09668 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Trustworthy Automated Driving through Qualitative Scene Understanding and ExplanationsComments: Transport Research Arena (TRA) 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present the Qualitative Explainable Graph (QXG): a unified symbolic and qualitative representation for scene understanding in urban mobility. QXG enables the interpretation of an automated vehicle's environment using sensor data and machine learning models. It leverages spatio-temporal graphs and qualitative constraints to extract scene semantics from raw sensor inputs, such as LiDAR and camera data, offering an intelligible scene model. Crucially, QXG can be incrementally constructed in real-time, making it a versatile tool for in-vehicle explanations and real-time decision-making across various sensor types. Our research showcases the transformative potential of QXG, particularly in the context of automated driving, where it elucidates decision rationales by linking the graph with vehicle actions. These explanations serve diverse purposes, from informing passengers and alerting vulnerable road users (VRUs) to enabling post-analysis of prior behaviours.
- [1404] arXiv:2403.09669 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative ModelsComments: Our work is accepted to ICLR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Image generative models have made significant progress in generating realistic and diverse images, supported by comprehensive guidance from various evaluation metrics. However, current video generative models struggle to generate even short video clips, with limited tools that provide insights for improvements. Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks, which may underestimate the unique characteristics of video. Our analysis reveals that the widely used Frechet Video Distance (FVD) has a stronger emphasis on the spatial aspect than the temporal naturalness of video and is inherently constrained by the input size of the embedding networks used, limiting it to 16 frames. Additionally, it demonstrates considerable instability and diverges from human evaluations. To address the limitations, we propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects. This feature allows comprehensive analysis and evaluation of video generative models from various perspectives, unconstrained by video length. We provide analytical and experimental evidence demonstrating that STREAM provides an effective evaluation tool for both visual and temporal quality of videos, offering insights into area of improvement for video generative models. To the best of our knowledge, STREAM is the first evaluation metric that can separately assess the temporal and spatial aspects of videos. Our code is available at this https URL .
- [1405] arXiv:2403.09671 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: CoRaiS: Lightweight Real-Time Scheduler for Multi-Edge Cooperative ComputingComments: Under ReviewSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI)
Abstract: Multi-edge cooperative computing that combines constrained resources of multiple edges into a powerful resource pool has the potential to deliver great benefits, such as a tremendous computing power, improved response time, more diversified services. However, the mass heterogeneous resources composition and lack of scheduling strategies make the modeling and cooperating of multi-edge computing system particularly complicated. This paper first proposes a system-level state evaluation model to shield the complex hardware configurations and redefine the different service capabilities at heterogeneous edges. Secondly, an integer linear programming model is designed to cater for optimally dispatching the distributed arriving requests. Finally, a learning-based lightweight real-time scheduler, CoRaiS, is proposed. CoRaiS embeds the real-time states of multi-edge system and requests information, and combines the embeddings with a policy network to schedule the requests, so that the response time of all requests can be minimized. Evaluation results verify that CoRaiS can make a high-quality scheduling decision in real time, and can be generalized to other multi-edge computing system, regardless of system scales. Characteristic validation also demonstrates that CoRaiS successfully learns to balance loads, perceive real-time state and recognize heterogeneity while scheduling.
- [1406] arXiv:2403.09673 (cross-list from q-bio.BM) [ pdf , ps , html , other ]
-
Title: FoldToken: Learning Protein Language via Vector Quantization and BeyondSubjects: Biomolecules (q-bio.BM) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce \textbf{FoldTokenizer} to represent protein sequence-structure as discrete symbols. This innovative approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We refer to the learned discrete symbols as \textbf{FoldToken}, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting and antibody design tasks, building the first GPT-style model (\textbf{FoldGPT}) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (\textbf{SoftCVQ}).
- [1407] arXiv:2403.09676 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language ModelsComments: AI deception, Large Language Models, ChatGPTSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This research critically navigates the intricate landscape of AI deception, concentrating on deceptive behaviours of Large Language Models (LLMs). My objective is to elucidate this issue, examine the discourse surrounding it, and subsequently delve into its categorization and ramifications. The essay initiates with an evaluation of the AI Safety Summit 2023 (ASS) and introduction of LLMs, emphasising multidimensional biases that underlie their deceptive behaviours.The literature review covers four types of deception categorised: Strategic deception, Imitation, Sycophancy, and Unfaithful Reasoning, along with the social implications and risks they entail. Lastly, I take an evaluative stance on various aspects related to navigating the persistent challenges of the deceptive AI. This encompasses considerations of international collaborative governance, the reconfigured engagement of individuals with AI, proposal of practical adjustments, and specific elements of digital education.
- [1408] arXiv:2403.09680 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Pre-Sorted Tsetlin Machine (The Genetic K-Medoid Method)Comments: 6 pages, 12 figures, 3 tablesSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This paper proposes a machine learning pre-sort stage to traditional supervised learning using Tsetlin Machines. Initially, K data-points are identified from the dataset using an expedited genetic algorithm to solve the maximum dispersion problem. These are then used as the initial placement to run the K-Medoid clustering algorithm. Finally, an expedited genetic algorithm is used to align K independent Tsetlin Machines by maximising hamming distance. For MNIST level classification problems, results demonstrate up to 10% improvement in accuracy, approx. 383X reduction in training time and approx. 86X reduction in inference time.
- [1409] arXiv:2403.09700 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Shapley Values-Powered Framework for Fair Reward Split in Content Produced by GenAIComments: 36 pages, 32 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: It is evident that, currently, generative models are surpassed in quality by human professionals. However, with the advancements in Artificial Intelligence, this gap will narrow, leading to scenarios where individuals who have dedicated years of their lives to mastering a skill become obsolete due to their high costs, which are inherently linked to the time they require to complete a task -- a task that AI could accomplish in minutes or seconds. To avoid future social upheavals, we must, even now, contemplate how to fairly assess the contributions of such individuals in training generative models and how to compensate them for the reduction or complete loss of their incomes. In this work, we propose a method to structure collaboration between model developers and data providers. To achieve this, we employ Shapley Values to quantify the contribution of artist(s) in an image generated by the Stable Diffusion-v1.5 model and to equitably allocate the reward among them.
- [1410] arXiv:2403.09703 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Concept-aware Data Construction Improves In-context Learning of Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings.
In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learning is more effective for a majority of new tasks when compared to traditional instruction tuning, resulting in a performance comparable to the previous in-context learners using magnitudes of more training data. - [1411] arXiv:2403.09704 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Alignment Studio: Aligning Large Language Models to Particular Contextual RegulationsSwapnaja Achintalwar , Ioana Baldini , Djallel Bouneffouf , Joan Byamugisha , Maria Chang , Pierre Dognin , Eitan Farchi , Ndivhuwo Makondo , Aleksandra Mojsilovic , Manish Nagireddy , Karthikeyan Natesan Ramamurthy , Inkit Padhi , Orna Raz , Jesus Rios , Prasanna Sattigeri , Moninder Singh , Siphiwe Thwala , Rosario A. Uceda-Sosa , Kush R. VarshneyComments: 7 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The alignment of large language models is usually done by model providers to add or control behaviors that are common or universally understood across use cases and contexts. In contrast, in this article, we present an approach and architecture that empowers application developers to tune a model to their particular values, social norms, laws and other regulations, and orchestrate between potentially conflicting requirements in context. We lay out three main components of such an Alignment Studio architecture: Framers, Instructors, and Auditors that work in concert to control the behavior of a language model. We illustrate this approach with a running example of aligning a company's internal-facing enterprise chatbot to its business conduct guidelines.
- [1412] arXiv:2403.09705 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Novel Nuanced Conversation Evaluation Framework for Large Language Models in Mental HealthSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: Understanding the conversation abilities of Large Language Models (LLMs) can help lead to its more cautious and appropriate deployment. This is especially important for safety-critical domains like mental health, where someone's life may depend on the exact wording of a response to an urgent question. In this paper, we propose a novel framework for evaluating the nuanced conversation abilities of LLMs. Within it, we develop a series of quantitative metrics developed from literature on using psychotherapy conversation analysis literature. While we ensure that our framework and metrics are transferable by researchers to relevant adjacent domains, we apply them to the mental health field. We use our framework to evaluate several popular frontier LLMs, including some GPT and Llama models, through a verified mental health dataset. Our results show that GPT4 Turbo can perform significantly more similarly to verified therapists than other selected LLMs. We conduct additional analysis to examine how LLM conversation performance varies across specific mental health topics. Our results indicate that GPT4 Turbo performs well in achieving high correlation with verified therapists in particular topics such as Parenting and Relationships. We believe our contributions will help researchers develop better LLMs that, in turn, will more positively support people's lives.
- [1413] arXiv:2403.09706 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Schema-Aware Multi-Task Learning for Complex Text-to-SQLComments: 8pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Databases (cs.DB)
Abstract: Conventional text-to-SQL parsers are not good at synthesizing complex SQL queries that involve multiple tables or columns, due to the challenges inherent in identifying the correct schema items and performing accurate alignment between question and schema items. To address the above issue, we present a schema-aware multi-task learning framework (named MTSQL) for complicated SQL queries. Specifically, we design a schema linking discriminator module to distinguish the valid question-schema linkings, which explicitly instructs the encoder by distinctive linking relations to enhance the alignment quality. On the decoder side, we define 6-type relationships to describe the connections between tables and columns (e.g., WHERE_TC), and introduce an operator-centric triple extractor to recognize those associated schema items with the predefined relationship. Also, we establish a rule set of grammar constraints via the predicted triples to filter the proper SQL operators and schema items during the SQL generation. On Spider, a cross-domain challenging text-to-SQL benchmark, experimental results indicate that MTSQL is more effective than baselines, especially in extremely hard scenarios. Moreover, further analyses verify that our approach leads to promising improvements for complicated SQL queries.
- [1414] arXiv:2403.09712 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Knowledge-Injected Curriculum Pretraining Framework for Question AnsweringComments: Accepted by WWW 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Knowledge-based question answering (KBQA) is a key task in NLP research, and also an approach to access the web data and knowledge, which requires exploiting knowledge graphs (KGs) for reasoning. In the literature, one promising solution for KBQA is to incorporate the pretrained language model (LM) with KGs by generating KG-centered pretraining corpus, which has shown its superiority. However, these methods often depend on specific techniques and resources to work, which may not always be available and restrict its application. Moreover, existing methods focus more on improving language understanding with KGs, while neglect the more important human-like complex reasoning. To this end, in this paper, we propose a general Knowledge-Injected Curriculum Pretraining framework (KICP) to achieve comprehensive KG learning and exploitation for KBQA tasks, which is composed of knowledge injection (KI), knowledge adaptation (KA) and curriculum reasoning (CR). Specifically, the KI module first injects knowledge into the LM by generating KG-centered pretraining corpus, and generalizes the process into three key steps that could work with different implementations for flexible application. Next, the KA module learns knowledge from the generated corpus with LM equipped with an adapter as well as keeps its original natural language understanding ability to reduce the negative impacts of the difference between the generated and natural corpus. Last, to enable the LM with complex reasoning, the CR module follows human reasoning patterns to construct three corpora with increasing difficulties of reasoning, and further trains the LM from easy to hard in a curriculum manner. We provide an implementation of the general framework, and evaluate the proposed KICP on four real-word datasets. The results demonstrate that our framework can achieve higher performances.
- [1415] arXiv:2403.09714 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Linguistic Structure Induction from Language ModelsComments: Master's Thesis. Supervised by Laura Kallmeyer and David ArpsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Linear sequences of words are implicitly represented in our brains by hierarchical structures that organize the composition of words in sentences. Linguists formalize different frameworks to model this hierarchy; two of the most common syntactic frameworks are Constituency and Dependency. Constituency represents sentences as nested groups of phrases, while dependency represents a sentence by assigning relations between its words. Recently, the pursuit of intelligent machines has produced Language Models (LMs) capable of solving many language tasks with a human-level performance. Many studies now question whether LMs implicitly represent syntactic hierarchies. This thesis focuses on producing constituency and dependency structures from LMs in an unsupervised setting. I review the critical methods in this field and highlight a line of work that utilizes a numerical representation for binary constituency trees (Syntactic Distance). I present a detailed study on StructFormer (SF) (Shen et al., 2021), which retrofits a transformer encoder architecture with a parser network to produce constituency and dependency structures. I present six experiments to analyze and address this field's challenges; experiments include investigating the effect of repositioning the parser network within the SF architecture, evaluating subword-based induced trees, and benchmarking the models developed in the thesis experiments on linguistic tasks. Models benchmarking is performed by participating in the BabyLM challenge, published at CoNLL 2023 (Momen et al., 2023). The results of this thesis encourage further development in the direction of retrofitting transformer-based models to induce syntactic structures, supported by the acceptable performance of SF in different experimental settings and the observed limitations that require innovative solutions to advance the state of syntactic structure induction.
- [1416] arXiv:2403.09717 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Enhancing Depression-Diagnosis-Oriented Chat with Psychological State TrackingSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Abstract: Depression-diagnosis-oriented chat aims to guide patients in self-expression to collect key symptoms for depression detection. Recent work focuses on combining task-oriented dialogue and chitchat to simulate the interview-based depression diagnosis. Whereas, these methods can not well capture the changing information, feelings, or symptoms of the patient during dialogues. Moreover, no explicit framework has been explored to guide the dialogue, which results in some useless communications that affect the experience. In this paper, we propose to integrate Psychological State Tracking (POST) within the large language model (LLM) to explicitly guide depression-diagnosis-oriented chat. Specifically, the state is adapted from a psychological theoretical model, which consists of four components, namely Stage, Information, Summary and Next. We fine-tune an LLM model to generate the dynamic psychological state, which is further used to assist response generation at each turn to simulate the psychiatrist. Experimental results on the existing benchmark show that our proposed method boosts the performance of all subtasks in depression-diagnosis-oriented chat.
- [1417] arXiv:2403.09718 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Comprehensive Implementation of TextCNN for Enhanced Collaboration between Natural Language Processing and System RecommendationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Natural Language Processing (NLP) is an important branch of artificial intelligence that studies how to enable computers to understand, process, and generate human language. Text classification is a fundamental task in NLP, which aims to classify text into different predefined categories. Text classification is the most basic and classic task in natural language processing, and most of the tasks in natural language processing can be regarded as classification tasks. In recent years, deep learning has achieved great success in many research fields, and today, it has also become a standard technology in the field of NLP, which is widely integrated into text classification tasks. Unlike numbers and images, text processing emphasizes fine-grained processing ability. Traditional text classification methods generally require preprocessing the input model's text data. Additionally, they also need to obtain good sample features through manual annotation and then use classical machine learning algorithms for classification. Therefore, this paper analyzes the application status of deep learning in the three core tasks of NLP (including text representation, word order modeling, and knowledge representation). This content explores the improvement and synergy achieved through natural language processing in the context of text classification, while also taking into account the challenges posed by adversarial techniques in text generation, text classification, and semantic parsing. An empirical study on text classification tasks demonstrates the effectiveness of interactive integration training, particularly in conjunction with TextCNN, highlighting the significance of these advancements in text classification augmentation and enhancement.
- [1418] arXiv:2403.09719 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Mevaker: Conclusion Extraction and Allocation Resources for the Hebrew LanguageSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this paper, we introduce summarization MevakerSumm and conclusion extraction MevakerConc datasets for the Hebrew language based on the State Comptroller and Ombudsman of Israel reports, along with two auxiliary datasets. We accompany these datasets with models for conclusion extraction (HeConE, HeConEspc) and conclusion allocation (HeCross). All of the code, datasets, and model checkpoints used in this work are publicly available.
- [1419] arXiv:2403.09720 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Fine-tuning vs Prompting, Can Language Models Understand Human Values?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Accurately handling the underlying support values in sentences is crucial for understanding the speaker's tendencies, yet it poses a challenging task in natural language understanding (NLU). In this article, we explore the potential of fine-tuning and prompt tuning in this downstream task, using the Human Value Detection 2023. Additionally, we attempt to validate whether models can effectively solve the problem based on the knowledge acquired during the pre-training stage. Simultaneously, our interest lies in the capabilities of large language models (LLMs) aligned with RLHF in this task, and some preliminary attempts are presented.
- [1420] arXiv:2403.09721 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Semantic Mention Graph Augmented Model for Document-Level Event Argument ExtractionComments: Accepted By Coling 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Document-level Event Argument Extraction (DEAE) aims to identify arguments and their specific roles from an unstructured document. The advanced approaches on DEAE utilize prompt-based methods to guide pre-trained language models (PLMs) in extracting arguments from input documents. They mainly concentrate on establishing relations between triggers and entity mentions within documents, leaving two unresolved problems: a) independent modeling of entity mentions; b) document-prompt isolation. To this end, we propose a semantic mention Graph Augmented Model (GAM) to address these two problems in this paper. Firstly, GAM constructs a semantic mention graph that captures relations within and between documents and prompts, encompassing co-existence, co-reference and co-type relations. Furthermore, we introduce an ensembled graph transformer module to address mentions and their three semantic relations effectively. Later, the graph-augmented encoder-decoder module incorporates the relation-specific graph into the input embedding of PLMs and optimizes the encoder section with topology information, enhancing the relations comprehensively. Extensive experiments on the RAMS and WikiEvents datasets demonstrate the effectiveness of our approach, surpassing baseline methods and achieving a new state-of-the-art performance.
- [1421] arXiv:2403.09722 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Enhancing Readmission Prediction with Deep Learning: Extracting Biomedical Concepts from Clinical TextsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Hospital readmission, defined as patients being re-hospitalized shortly after discharge, is a critical concern as it impacts patient outcomes and healthcare costs. Identifying patients at risk of readmission allows for timely interventions, reducing re-hospitalization rates and overall treatment costs. This study focuses on predicting patient readmission within less than 30 days using text mining techniques applied to discharge report texts from electronic health records (EHR). Various machine learning and deep learning methods were employed to develop a classification model for this purpose. A novel aspect of this research involves leveraging the Bio-Discharge Summary Bert (BDSS) model along with principal component analysis (PCA) feature extraction to preprocess data for deep learning model input. Our analysis of the MIMIC-III dataset indicates that our approach, which combines the BDSS model with a multilayer perceptron (MLP), outperforms state-of-the-art methods. This model achieved a recall of 94% and an area under the curve (AUC) of 75%, showcasing its effectiveness in predicting patient readmissions. This study contributes to the advancement of predictive modeling in healthcare by integrating text mining techniques with deep learning algorithms to improve patient outcomes and optimize resource allocation.
- [1422] arXiv:2403.09725 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: RAD-PHI2: Instruction Tuning PHI-2 for RadiologySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Small Language Models (SLMs) have shown remarkable performance in general domain language understanding, reasoning and coding tasks, but their capabilities in the medical domain, particularly concerning radiology text, is less explored. In this study, we investigate the application of SLMs for general radiology knowledge specifically question answering related to understanding of symptoms, radiological appearances of findings, differential diagnosis, assessing prognosis, and suggesting treatments w.r.t diseases pertaining to different organ systems. Additionally, we explore the utility of SLMs in handling text-related tasks with respect to radiology reports within AI-driven radiology workflows. We fine-tune Phi-2, a SLM with 2.7 billion parameters using high-quality educational content from Radiopaedia, a collaborative online radiology resource. The resulting language model, RadPhi-2-Base, exhibits the ability to address general radiology queries across various systems (e.g., chest, cardiac). Furthermore, we investigate Phi-2 for instruction tuning, enabling it to perform specific tasks. By fine-tuning Phi-2 on both general domain tasks and radiology-specific tasks related to chest X-ray reports, we create Rad-Phi2. Our empirical results reveal that Rad-Phi2 Base and Rad-Phi2 perform comparably or even outperform larger models such as Mistral-7B-Instruct-v0.2 and GPT-4 providing concise and precise answers. In summary, our work demonstrates the feasibility and effectiveness of utilizing SLMs in radiology workflows both for knowledge related queries as well as for performing specific tasks related to radiology reports thereby opening up new avenues for enhancing the quality and efficiency of radiology practice.
- [1423] arXiv:2403.09727 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systemsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The development of generative large language models (G-LLM) opened up new opportunities for the development of new types of knowledge-based systems similar to ChatGPT, Bing, or Gemini. Fine-tuning (FN) and Retrieval-Augmented Generation (RAG) are the techniques that can be used to implement domain adaptation for the development of G-LLM-based knowledge systems. In our study, using ROUGE, BLEU, METEOR scores, and cosine similarity, we compare and examine the performance of RAG and FN for the GPT-J-6B, OPT-6.7B, LlaMA, LlaMA-2 language models. Based on measurements shown on different datasets, we demonstrate that RAG-based constructions are more efficient than models produced with FN. We point out that connecting RAG and FN is not trivial, because connecting FN models with RAG can cause a decrease in performance. Furthermore, we outline a simple RAG-based architecture which, on average, outperforms the FN models by 16% in terms of the ROGUE score, 15% in the case of the BLEU score, and 53% based on the cosine similarity. This shows the significant advantage of RAG over FN in terms of hallucination, which is not offset by the fact that the average 8% better METEOR score of FN models indicates greater creativity compared to RAG.
- [1424] arXiv:2403.09728 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Simulating Weighted Automata over Sequences and Trees with TransformersSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
Abstract: Transformers are ubiquitous models in the natural language processing (NLP) community and have shown impressive empirical successes in the past few years. However, little is understood about how they reason and the limits of their computational capabilities. These models do not process data sequentially, and yet outperform sequential neural models such as RNNs. Recent work has shown that these models can compactly simulate the sequential reasoning abilities of deterministic finite automata (DFAs). This leads to the following question: can transformers simulate the reasoning of more complex finite state machines? In this work, we show that transformers can simulate weighted finite automata (WFAs), a class of models which subsumes DFAs, as well as weighted tree automata (WTA), a generalization of weighted automata to tree structured inputs. We prove these claims formally and provide upper bounds on the sizes of the transformer models needed as a function of the number of states the target automata. Empirically, we perform synthetic experiments showing that transformers are able to learn these compact solutions via standard gradient-based training.
- [1425] arXiv:2403.09732 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PET-SQL: A Prompt-enhanced Two-stage Text-to-SQL Framework with Cross-consistencyZhishuai Li , Xiang Wang , Jingjing Zhao , Sun Yang , Guoqing Du , Xiaoru Hu , Bin Zhang , Yuxiao Ye , Ziyue Li , Rui Zhao , Hangyu MaoSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in Text-to-SQL (Text2SQL) emphasize stimulating the large language models (LLM) on in-context learning, achieving significant results. Nevertheless, they face challenges when dealing with verbose database information and complex user intentions. This paper presents a two-stage framework to enhance the performance of current LLM-based natural language to SQL systems. We first introduce a novel prompt representation, called reference-enhanced representation, which includes schema information and randomly sampled cell values from tables to instruct LLMs in generating SQL queries. Then, in the first stage, question-SQL pairs are retrieved as few-shot demonstrations, prompting the LLM to generate a preliminary SQL (PreSQL). After that, the mentioned entities in PreSQL are parsed to conduct schema linking, which can significantly compact the useful information. In the second stage, with the linked schema, we simplify the prompt's schema information and instruct the LLM to produce the final SQL. Finally, as the post-refinement module, we propose using cross-consistency across different LLMs rather than self-consistency within a particular LLM. Our methods achieve new SOTA results on the Spider benchmark, with an execution accuracy of 87.6%.
- [1426] arXiv:2403.09733 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: OverleafCopilot: Empowering Academic Writing in Overleaf with Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The rapid development of Large Language Models (LLMs) has facilitated a variety of applications from different domains. In this technical report, we explore the integration of LLMs and the popular academic writing tool, Overleaf, to enhance the efficiency and quality of academic writing. To achieve the above goal, there are three challenges: i) including seamless interaction between Overleaf and LLMs, ii) establishing reliable communication with the LLM provider, and iii) ensuring user privacy. To address these challenges, we present OverleafCopilot, the first-ever tool (i.e., a browser extension) that seamlessly integrates LLMs and Overleaf, enabling researchers to leverage the power of LLMs while writing papers. Specifically, we first propose an effective framework to bridge LLMs and Overleaf. Then, we developed PromptGenius, a website for researchers to easily find and share high-quality up-to-date prompts. Thirdly, we propose an agent command system to help researchers quickly build their customizable agents. OverleafCopilot ( this https URL ) has been on the Chrome Extension Store, which now serves thousands of researchers. Additionally, the code of PromptGenius is released at this https URL . We believe our work has the potential to revolutionize academic writing practices, empowering researchers to produce higher-quality papers in less time.
- [1427] arXiv:2403.09734 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Do Large Language Models Solve ARC Visual Analogies Like People Do?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The Abstraction Reasoning Corpus (ARC) is a visual analogical reasoning test designed for humans and machines (Chollet, 2019). We compared human and large language model (LLM) performance on a new child-friendly set of ARC items. Results show that both children and adults outperform most LLMs on these tasks. Error analysis revealed a similar "fallback" solution strategy in LLMs and young children, where part of the analogy is simply copied. In addition, we found two other error types, one based on seemingly grasping key concepts (e.g., Inside-Outside) and the other based on simple combinations of analogy input matrices. On the whole, "concept" errors were more common in humans, and "matrix" errors were more common in LLMs. This study sheds new light on LLM reasoning ability and the extent to which we can use error analyses and comparisons with human development to understand how LLMs solve visual analogies.
- [1428] arXiv:2403.09735 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: A Sophisticated Framework for the Accurate Detection of Phishing WebsitesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Phishing is an increasingly sophisticated form of cyberattack that is inflicting huge financial damage to corporations throughout the globe while also jeopardizing individuals' privacy. Attackers are constantly devising new methods of launching such assaults and detecting them has become a daunting task. Many different techniques have been suggested, each with its own pros and cons. While machine learning-based techniques have been most successful in identifying such attacks, they continue to fall short in terms of performance and generalizability. This paper proposes a comprehensive methodology for detecting phishing websites. The goal is to design a system that is capable of accurately distinguishing phishing websites from legitimate ones and provides generalized performance over a broad variety of datasets. A combination of feature selection, greedy algorithm, cross-validation, and deep learning methods have been utilized to construct a sophisticated stacking ensemble classifier. Extensive experimentation on four different phishing datasets was conducted to evaluate the performance of the proposed technique. The proposed algorithm outperformed the other existing phishing detection models obtaining accuracy of 97.49%, 98.23%, 97.48%, and 98.20% on dataset-1 (UCI Phishing Websites Dataset), dataset-2 (Phishing Dataset for Machine Learning: Feature Evaluation), dataset-3 (Phishing Websites Dataset), and dataset-4 (Web page phishing detection), respectively. The high accuracy values obtained across all datasets imply the models' generalizability and effectiveness in the accurate identification of phishing websites.
- [1429] arXiv:2403.09738 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluating Large Language Models as Generative User Simulators for Conversational RecommendationComments: NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. Large language models show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.
- [1430] arXiv:2403.09740 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Teaching Machines to Code: Smart Contract Translation with LLMsSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: The advent of large language models (LLMs) has marked a significant milestone in the realm of artificial intelligence, with their capabilities often matching or surpassing human expertise in various domains. Among these achievements, their adeptness in translation tasks stands out, closely mimicking the intricate and preliminary processes undertaken by human translators to ensure the fidelity and quality of the translated content. Despite the advancements in utilizing LLMs for translating programming code across different languages, the domain of smart contract translation, particularly into languages not previously encountered by the LLM, remains largely unexplored. In our research, we present a pioneering approach, SolMover, which harnesses the synergy of two distinct LLMs within a unified framework. This framework is designed to grasp coding principles and apply this understanding to the translation of code into an unfamiliar language. Our study delves into the capacity of LLMs to mimic human learning processes, offering an in-depth evaluation of our methodology for converting smart contracts written in Solidity to Move, a language with limited resources. The framework employs one LLM to decipher coding conventions for the new language, creating a blueprint for the second LLM, which, lacking planning abilities, possesses coding expertise. The empirical evidence from our experiments suggests that SolMover substantially enhances performance compared to gpt-3.5-turbo-1106, and achieves superior results over competitors such as Palm2 and Mixtral-8x7B-Instruct. Additionally, our analysis highlights the efficacy of our bug mitigation strategy in elevating code quality across all models, even outside the SolMover framework.
- [1431] arXiv:2403.09743 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Human Factor in Detecting Errors of Large Language Models: A Systematic Literature Review and Future Research DirectionsComments: 21 papers analysed and synthesized in detail from a total search result size of 594 (raw results) / 61 (scanned) / 28 (selected)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The launch of ChatGPT by OpenAI in November 2022 marked a pivotal moment for Artificial Intelligence, introducing Large Language Models (LLMs) to the mainstream and setting new records in user adoption. LLMs, particularly ChatGPT, trained on extensive internet data, demonstrate remarkable conversational capabilities across various domains, suggesting a significant impact on the workforce. However, these models are susceptible to errors - "hallucinations" and omissions, generating incorrect or incomplete information. This poses risks especially in contexts where accuracy is crucial, such as legal compliance, medicine or fine-grained process frameworks.
There are both technical and human solutions to cope with this isse. This paper explores the human factors that enable users to detect errors in LLM outputs, a critical component in mitigating risks associated with their use in professional settings. Understanding these factors is essential for organizations aiming to leverage LLM technology efficiently, guiding targeted training and deployment strategies to enhance error detection by users. This approach not only aims to optimize the use of LLMs but also to prevent potential downstream issues stemming from reliance on inaccurate model responses. The research emphasizes the balance between technological advancement and human insight in maximizing the benefits of LLMs while minimizing the risks, particularly in areas where precision is paramount.
This paper performs a systematic literature research on this research topic, analyses and synthesizes the findings, and outlines future research directions. Literature selection cut-off date is January 11th 2024. - [1432] arXiv:2403.09744 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluating the Application of Large Language Models to Generate Feedback in Programming EducationComments: accepted at IEEE Global Engineering Education Conference 2024, Kos, GreeceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Abstract: This study investigates the application of large language models, specifically GPT-4, to enhance programming education. The research outlines the design of a web application that uses GPT-4 to provide feedback on programming tasks, without giving away the solution. A web application for working on programming tasks was developed for the study and evaluated with 51 students over the course of one semester. The results show that most of the feedback generated by GPT-4 effectively addressed code errors. However, challenges with incorrect suggestions and hallucinated issues indicate the need for further improvements.
- [1433] arXiv:2403.09747 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Re-Search for The Truth: Multi-round Retrieval-augmented Large Language Models are Strong Fake News DetectorsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The proliferation of fake news has had far-reaching implications on politics, the economy, and society at large. While Fake news detection methods have been employed to mitigate this issue, they primarily depend on two essential elements: the quality and relevance of the evidence, and the effectiveness of the verdict prediction mechanism. Traditional methods, which often source information from static repositories like Wikipedia, are limited by outdated or incomplete data, particularly for emerging or rare claims. Large Language Models (LLMs), known for their remarkable reasoning and generative capabilities, introduce a new frontier for fake news detection. However, like traditional methods, LLM-based solutions also grapple with the limitations of stale and long-tail knowledge. Additionally, retrieval-enhanced LLMs frequently struggle with issues such as low-quality evidence retrieval and context length constraints. To address these challenges, we introduce a novel, retrieval-augmented LLMs framework--the first of its kind to automatically and strategically extract key evidence from web sources for claim verification. Employing a multi-round retrieval strategy, our framework ensures the acquisition of sufficient, relevant evidence, thereby enhancing performance. Comprehensive experiments across three real-world datasets validate the framework's superiority over existing methods. Importantly, our model not only delivers accurate verdicts but also offers human-readable explanations to improve result interpretability.
- [1434] arXiv:2403.09749 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Diverse Perspective Learning with Selection over Multiple Temporal PoolingsComments: 17 pages, 9 figuresJournal-ref: AAAI 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In Time Series Classification (TSC), temporal pooling methods that consider sequential information have been proposed. However, we found that each temporal pooling has a distinct mechanism, and can perform better or worse depending on time series data. We term this fixed pooling mechanism a single perspective of temporal poolings. In this paper, we propose a novel temporal pooling method with diverse perspective learning: Selection over Multiple Temporal Poolings (SoM-TP). SoM-TP dynamically selects the optimal temporal pooling among multiple methods for each data by attention. The dynamic pooling selection is motivated by the ensemble concept of Multiple Choice Learning (MCL), which selects the best among multiple outputs. The pooling selection by SoM-TP's attention enables a non-iterative pooling ensemble within a single classifier. Additionally, we define a perspective loss and Diverse Perspective Learning Network (DPLN). The loss works as a regularizer to reflect all the pooling perspectives from DPLN. Our perspective analysis using Layer-wise Relevance Propagation (LRP) reveals the limitation of a single perspective and ultimately demonstrates diverse perspective learning of SoM-TP. We also show that SoM-TP outperforms CNN models based on other temporal poolings and state-of-the-art models in TSC with extensive UCR/UEA repositories.
- [1435] arXiv:2403.09750 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Meta-Cognitive Analysis: Evaluating Declarative and Procedural Knowledge in Datasets and Large Language ModelsComments: Accepted by LREC-COLING 2024 as a short paperSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Declarative knowledge and procedural knowledge are two key parts in meta-cognitive theory, and these two hold significant importance in pre-training and inference of LLMs. However, a comprehensive analysis comparing these two types of knowledge is lacking, primarily due to challenges in definition, probing and quantitative assessment. In this paper, we explore from a new perspective by providing ground-truth knowledge for LLMs and evaluating the effective score. Through extensive experiments with widely-used datasets and models, we get conclusions: (1) In most tasks, benefits from declarative knowledge are greater than those from procedural knowledge. (2) Profits of procedural knowledge are larger than declarative knowledge only in reasoning tasks with simple logic. (3) As pre-training progresses and size increases, model ability to utilize both kinds of knowledge significantly improves, but in different speed. We do detailed analysis for the findings and this can provide primary guidance for evaluation and enhancement of large language models.
- [1436] arXiv:2403.09751 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: What Was Your Prompt? A Remote Keylogging Attack on AI AssistantsSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: AI assistants are becoming an integral part of society, used for asking advice or help in personal and confidential issues. In this paper, we unveil a novel side-channel that can be used to read encrypted responses from AI Assistants over the web: the token-length side-channel. We found that many vendors, including OpenAI and Microsoft, have this side-channel.
However, inferring the content of a response from a token-length sequence alone proves challenging. This is because tokens are akin to words, and responses can be several sentences long leading to millions of grammatically correct sentences. In this paper, we show how this can be overcome by (1) utilizing the power of a large language model (LLM) to translate these sequences, (2) providing the LLM with inter-sentence context to narrow the search space and (3) performing a known-plaintext attack by fine-tuning the model on the target model's writing style.
Using these methods, we were able to accurately reconstruct 29\% of an AI assistant's responses and successfully infer the topic from 55\% of them. To demonstrate the threat, we performed the attack on OpenAI's ChatGPT-4 and Microsoft's Copilot on both browser and API traffic. - [1437] arXiv:2403.09752 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things SystemsComments: 33 pages, 8 figures, journal paperSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The Internet of Medical Things (IoMT) transcends traditional medical boundaries, enabling a transition from reactive treatment to proactive prevention. This innovative method revolutionizes healthcare by facilitating early disease detection and tailored care, particularly in chronic disease management, where IoMT automates treatments based on real-time health data collection. Nonetheless, its benefits are countered by significant security challenges that endanger the lives of its users due to the sensitivity and value of the processed data, thereby attracting malicious interests. Moreover, the utilization of wireless communication for data transmission exposes medical data to interception and tampering by cybercriminals. Additionally, anomalies may arise due to human errors, network interference, or hardware malfunctions. In this context, anomaly detection based on Machine Learning (ML) is an interesting solution, but it comes up against obstacles in terms of explicability and protection of privacy. To address these challenges, a new framework for Intrusion Detection Systems (IDS) is introduced, leveraging Artificial Neural Networks (ANN) for intrusion detection while utilizing Federated Learning (FL) for privacy preservation. Additionally, eXplainable Artificial Intelligence (XAI) methods are incorporated to enhance model explanation and interpretation. The efficacy of the proposed framework is evaluated and compared with centralized approaches using multiple datasets containing network and medical data, simulating various attack types impacting the confidentiality, integrity, and availability of medical and physiological data. The results obtained offer compelling evidence that the FL method performs comparably to the centralized method, demonstrating high performance. Additionally, it affords the dual advantage of safeguarding privacy and providing model explanation.
- [1438] arXiv:2403.09753 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different LanguagesComments: Accepted as a full paper by the tinyML Research Symposium 2024Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Benchmarking plays a pivotal role in assessing and enhancing the performance of compact deep learning models designed for execution on resource-constrained devices, such as microcontrollers. Our study introduces a novel, entirely artificially generated benchmarking dataset tailored for speech recognition, representing a core challenge in the field of tiny deep learning. SpokeN-100 consists of spoken numbers from 0 to 99 spoken by 32 different speakers in four different languages, namely English, Mandarin, German and French, resulting in 12,800 audio samples. We determine auditory features and use UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) as a dimensionality reduction method to show the diversity and richness of the dataset. To highlight the use case of the dataset, we introduce two benchmark tasks: given an audio sample, classify (i) the used language and/or (ii) the spoken number. We optimized state-of-the-art deep neural networks and performed an evolutionary neural architecture search to find tiny architectures optimized for the 32-bit ARM Cortex-M4 nRF52840 microcontroller. Our results represent the first benchmark data achieved for SpokeN-100.
- [1439] arXiv:2403.09762 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Emotional Intelligence Through Artificial Intelligence : NLP and Deep Learning in the Analysis of Healthcare TextsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Abstract: This manuscript presents a methodical examination of the utilization of Artificial Intelligence in the assessment of emotions in texts related to healthcare, with a particular focus on the incorporation of Natural Language Processing and deep learning technologies. We scrutinize numerous research studies that employ AI to augment sentiment analysis, categorize emotions, and forecast patient outcomes based on textual information derived from clinical narratives, patient feedback on medications, and online health discussions. The review demonstrates noteworthy progress in the precision of algorithms used for sentiment classification, the prognostic capabilities of AI models for neurodegenerative diseases, and the creation of AI-powered systems that offer support in clinical decision-making. Remarkably, the utilization of AI applications has exhibited an enhancement in personalized therapy plans by integrating patient sentiment and contributing to the early identification of mental health disorders. There persist challenges, which encompass ensuring the ethical application of AI, safeguarding patient confidentiality, and addressing potential biases in algorithmic procedures. Nevertheless, the potential of AI to revolutionize healthcare practices is unmistakable, offering a future where healthcare is not only more knowledgeable and efficient but also more empathetic and centered around the needs of patients. This investigation underscores the transformative influence of AI on healthcare, delivering a comprehensive comprehension of its role in examining emotional content in healthcare texts and highlighting the trajectory towards a more compassionate approach to patient care. The findings advocate for a harmonious synergy between AI's analytical capabilities and the human aspects of healthcare.
- [1440] arXiv:2403.09793 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Socially Integrated Navigation: A Social Acting Robot with Deep Reinforcement LearningSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract: Mobile robots are being used on a large scale in various crowded situations and become part of our society. The socially acceptable navigation behavior of a mobile robot with individual human consideration is an essential requirement for scalable applications and human acceptance. Deep Reinforcement Learning (DRL) approaches are recently used to learn a robot's navigation policy and to model the complex interactions between robots and humans. We propose to divide existing DRL-based navigation approaches based on the robot's exhibited social behavior and distinguish between social collision avoidance with a lack of social behavior and socially aware approaches with explicit predefined social behavior. In addition, we propose a novel socially integrated navigation approach where the robot's social behavior is adaptive and emerges from the interaction with humans. The formulation of our approach is derived from a sociological definition, which states that social acting is oriented toward the acting of others. The DRL policy is trained in an environment where other agents interact socially integrated and reward the robot's behavior individually. The simulation results indicate that the proposed socially integrated navigation approach outperforms a socially aware approach in terms of distance traveled, time to completion, and negative impact on all agents within the environment.
- [1441] arXiv:2403.09795 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming PreventionSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Powerful generative Large Language Models (LLMs) are becoming popular tools amongst the general public as question-answering systems, and are being utilised by vulnerable groups such as children. With children increasingly interacting with these tools, it is imperative for researchers to scrutinise the safety of LLMs, especially for applications that could lead to serious outcomes, such as online child safety queries. In this paper, the efficacy of LLMs for online grooming prevention is explored both for identifying and avoiding grooming through advice generation, and the impact of prompt design on model performance is investigated by varying the provided context and prompt specificity. In results reflecting over 6,000 LLM interactions, we find that no models were clearly appropriate for online grooming prevention, with an observed lack of consistency in behaviours, and potential for harmful answer generation, especially from open-source models. We outline where and how models fall short, providing suggestions for improvement, and identify prompt designs that heavily altered model performance in troubling ways, with findings that can be used to inform best practice usage guides.
- [1442] arXiv:2403.09809 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Self-Supervised Learning for Time Series: Contrastive or Generative?Comments: Published at the AI4TS Workshop, IJCAI 2023Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Abstract: Self-supervised learning (SSL) has recently emerged as a powerful approach to learning representations from large-scale unlabeled data, showing promising results in time series analysis. The self-supervised representation learning can be categorized into two mainstream: contrastive and generative. In this paper, we will present a comprehensive comparative study between contrastive and generative methods in time series. We first introduce the basic frameworks for contrastive and generative SSL, respectively, and discuss how to obtain the supervision signal that guides the model optimization. We then implement classical algorithms (SimCLR vs. MAE) for each type and conduct a comparative analysis in fair settings. Our results provide insights into the strengths and weaknesses of each approach and offer practical recommendations for choosing suitable SSL methods. We also discuss the implications of our findings for the broader field of representation learning and propose future research directions. All the code and data are released at \url{ this https URL }.
- [1443] arXiv:2403.09810 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing SystemsChu Li , Zhihan Zhang , Michael Saugstad , Esteban Safranchik , Minchu Kulkarni , Xiaoyu Huang , Shwetak Patel , Vikram Iyer , Tim Althoff , Jon E. FroehlichSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Crowdsourcing platforms have transformed distributed problem-solving, yet quality control remains a persistent challenge. Traditional quality control measures, such as prescreening workers and refining instructions, often focus solely on optimizing economic output. This paper explores just-in-time AI interventions to enhance both labeling quality and domain-specific knowledge among crowdworkers. We introduce LabelAId, an advanced inference model combining Programmatic Weak Supervision (PWS) with FT-Transformers to infer label correctness based on user behavior and domain knowledge. Our technical evaluation shows that our LabelAId pipeline consistently outperforms state-of-the-art ML baselines, improving mistake inference accuracy by 36.7% with 50 downstream samples. We then implemented LabelAId into Project Sidewalk, an open-source crowdsourcing platform for urban accessibility. A between-subjects study with 34 participants demonstrates that LabelAId significantly enhances label precision without compromising efficiency while also increasing labeler confidence. We discuss LabelAId's success factors, limitations, and its generalizability to other crowdsourced science domains.
- [1444] arXiv:2403.09830 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards the Reusability and Compositionality of Causal RepresentationsComments: Accepted to the 3rd Conference on Causal Learning and Reasoning (CLeaR 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Causal Representation Learning (CRL) aims at identifying high-level causal factors and their relationships from high-dimensional observations, e.g., images. While most CRL works focus on learning causal representations in a single environment, in this work we instead propose a first step towards learning causal representations from temporal sequences of images that can be adapted in a new environment, or composed across multiple related environments. In particular, we introduce DECAF, a framework that detects which causal factors can be reused and which need to be adapted from previously learned causal representations. Our approach is based on the availability of intervention targets, that indicate which variables are perturbed at each time step. Experiments on three benchmark datasets show that integrating our framework with four state-of-the-art CRL approaches leads to accurate representations in a new environment with only a few samples.
- [1445] arXiv:2403.09847 (cross-list from physics.space-ph) [ pdf , ps , html , other ]
-
Title: Forecasting Geoffective Events from Solar Wind Data and Evaluating the Most Predictive Features through Machine Learning ApproachesSabrina Guastavino , Katsiaryna Bahamazava , Emma Perracchione , Fabiana Camattari , Gianluca Audone , Daniele Telloni , Roberto Susino , Gianalfredo Nicolini , Silvano Fineschi , Michele Piana , Anna Maria MassoneSubjects: Space Physics (physics.space-ph) ; Solar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
Abstract: This study addresses the prediction of geomagnetic disturbances by exploiting machine learning techniques. Specifically, the Long-Short Term Memory recurrent neural network, which is particularly suited for application over long time series, is employed in the analysis of in-situ measurements of solar wind plasma and magnetic field acquired over more than one solar cycle, from $2005$ to $2019$, at the Lagrangian point L$1$. The problem is approached as a binary classification aiming to predict one hour in advance a decrease in the SYM-H geomagnetic activity index below the threshold of $-50$ nT, which is generally regarded as indicative of magnetospheric perturbations. The strong class imbalance issue is tackled by using an appropriate loss function tailored to optimize appropriate skill scores in the training phase of the neural network. Beside classical skill scores, value-weighted skill scores are then employed to evaluate predictions, suitable in the study of problems, such as the one faced here, characterized by strong temporal variability. For the first time, the content of magnetic helicity and energy carried by solar transients, associated with their detection and likelihood of geo-effectiveness, were considered as input features of the network architecture. Their predictive capabilities are demonstrated through a correlation-driven feature selection method to rank the most relevant characteristics involved in the neural network prediction model. The optimal performance of the adopted neural network in properly forecasting the onset of geomagnetic storms, which is a crucial point for giving real warnings in an operational setting, is finally showed.
- [1446] arXiv:2403.09849 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Self-Consistency Boosts Calibration for Math ReasoningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).
- [1447] arXiv:2403.09857 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Few-Shot Class Incremental Learning with Attention-Aware Self-Adaptive PromptSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Few-Shot Class-Incremental Learning (FSCIL) models aim to incrementally learn new classes with scarce samples while preserving knowledge of old ones. Existing FSCIL methods usually fine-tune the entire backbone, leading to overfitting and hindering the potential to learn new classes. On the other hand, recent prompt-based CIL approaches alleviate forgetting by training prompts with sufficient data in each task. In this work, we propose a novel framework named Attention-aware Self-adaptive Prompt (ASP). ASP encourages task-invariant prompts to capture shared knowledge by reducing specific information from the attention aspect. Additionally, self-adaptive task-specific prompts in ASP provide specific information and transfer knowledge from old classes to new classes with an Information Bottleneck learning objective. In summary, ASP prevents overfitting on base task and does not require enormous data in few-shot incremental tasks. Extensive experiments on three benchmark datasets validate that ASP consistently outperforms state-of-the-art FSCIL and prompt-based CIL methods in terms of both learning new classes and mitigating forgetting.
- [1448] arXiv:2403.09861 (cross-list from cs.ET) [ pdf , ps , html , other ]
-
Title: NN-Defined Modulator: Reconfigurable and Portable Software Modulator on IoT GatewaysJournal-ref: NSDI 2024Subjects: Emerging Technologies (cs.ET) ; Artificial Intelligence (cs.AI)
Abstract: A physical-layer modulator is a vital component for an IoT gateway to map the symbols to signals. However, due to the soldered hardware chipsets on the gateway's motherboards or the diverse toolkits on different platforms for the software radio, the existing solutions either have limited extensibility or are platform-specific. Such limitation is hard to ignore when modulation schemes and hardware platforms have become extremely diverse. This paper presents a new paradigm of using neural networks as an abstraction layer for physical layer modulators in IoT gateway devices, referred to as NN-defined modulators. Our approach addresses the challenges of extensibility and portability for multiple technologies on various hardware platforms. The proposed NN-defined modulator uses a model-driven methodology rooted in solid mathematical foundations while having native support for hardware acceleration and portability to heterogeneous platforms. We conduct the evaluation of NN-defined modulators on different platforms, including Nvidia Jetson Nano and Raspberry Pi. Evaluations demonstrate that our NN-defined modulator effectively operates as conventional modulators and provides significant efficiency gains (up to $4.7\times$ on Nvidia Jetson Nano and $1.1\times$ on Raspberry Pi), indicating high portability. Furthermore, we show the real-world applications using our NN-defined modulators to generate ZigBee and WiFi packets, which are compliant with commodity TI CC2650 (ZigBee) and Intel AX201 (WiFi NIC), respectively.
- [1449] arXiv:2403.09863 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards White Box Deep LearningComments: 16 pages, 12 figures, independent research, v5 changes: Expanded Abstract and Related Work section; minor wording improvementsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Deep neural networks learn fragile "shortcut" features, rendering them difficult to interpret (black box) and vulnerable to adversarial attacks. This paper proposes semantic features as a general architectural solution to this problem. The main idea is to make features locality-sensitive in the adequate semantic topology of the domain, thus introducing a strong regularization. The proof of concept network is lightweight, inherently interpretable and achieves almost human-level adversarial test metrics - with no adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at this https URL
- [1450] arXiv:2403.09869 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Mind the GAP: Improving Robustness to Subpopulation Shifts with Group-Aware PriorsComments: Published in Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)Subjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
Abstract: Machine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well under subpopulation shifts. We design a simple group-aware prior that only requires access to a small set of data with group information and demonstrate that training with this prior yields state-of-the-art performance -- even when only retraining the final layer of a previously trained non-robust model. Group aware-priors are conceptually simple, complementary to existing approaches, such as attribute pseudo labeling and data reweighting, and open up promising new avenues for harnessing Bayesian inference to enable robustness to subpopulation shifts.
- [1451] arXiv:2403.09871 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal ImageComments: 20 pages, 6 pages, 5 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: In this work, we present ThermoHands, a new benchmark for thermal image-based egocentric 3D hand pose estimation, aimed at overcoming challenges like varying lighting and obstructions (e.g., handwear). The benchmark includes a diverse dataset from 28 subjects performing hand-object and hand-virtual interactions, accurately annotated with 3D hand poses through an automated process. We introduce a bespoken baseline method, TheFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TheFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.
- [1452] arXiv:2403.09887 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Sabi\'a-2: A New Generation of Portuguese Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Medium, matches or surpasses GPT-4's performance in 23 out of 64 exams and outperforms GPT-3.5 in 58 out of 64 exams. Notably, specialization has a significant impact on a model's performance without the need to increase its size, allowing us to offer Sabiá-2 Medium at a price per token that is 10 times cheaper than GPT-4. Finally, we identified that math and coding are key abilities that need improvement.
- [1453] arXiv:2403.09891 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Fisher Mask Nodes for Language Model MergingComments: Accepted at LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing field of model merging provides a solution, dealing with the challenge of combining multiple task-specific models into a single multi-task model. In this study, we introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Utilizing the Fisher information of mask nodes within the Transformer architecture, we devise a computationally efficient weighted-averaging scheme. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost, with baseline performance improvements of up to +6.5 and a speedup between 57.4x and 321.7x across models. Our results prove the potential of our method in current multi-task learning environments and suggest its scalability and adaptability to new model architectures and learning scenarios.
- [1454] arXiv:2403.09904 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server, while being respectful of privacy. A critical bottleneck in FL is the communication cost. A pivotal strategy to mitigate this burden is \emph{Local Training}, which involves running multiple local stochastic gradient descent iterations between communication phases. Our work is inspired by the innovative \emph{Scaffnew} algorithm, which has considerably advanced the reduction of communication complexity in FL. We introduce FedComLoc (Federated Compressed and Local Training), integrating practical and effective compression into \emph{Scaffnew} to further enhance communication efficiency. Extensive experiments, using the popular TopK compressor and quantization, demonstrate its prowess in substantially reducing communication overheads in heterogeneous settings.
- [1455] arXiv:2403.09920 (cross-list from eess.IV) [ pdf , ps , other ]
-
Title: Predicting Generalization of AI Colonoscopy Models to Unseen DataJoel Shor , Carson McNeil , Yotam Intrator , Joseph R Ledsam , Hiro-o Yamano , Daisuke Tsurumaru , Hiroki Kayama , Atsushi Hamabe , Koji Ando , Mitsuhiko Ota , Haruei Ogino , Hiroshi Nakase , Kaho Kobayashi , Masaaki Miyo , Eiji Oki , Ichiro Takemasa , Ehud Rivlin , Roman GoldenbergSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Abstract: $\textbf{Background}$: Generalizability of AI colonoscopy algorithms is important for wider adoption in clinical practice. However, current techniques for evaluating performance on unseen data require expensive and time-intensive labels.
$\textbf{Methods}$: We use a "Masked Siamese Network" (MSN) to identify novel phenomena in unseen data and predict polyp detector performance. MSN is trained to predict masked out regions of polyp images, without any labels. We test MSN's ability to be trained on data only from Israel and detect unseen techniques, narrow-band imaging (NBI) and chromendoscoy (CE), on colonoscopes from Japan (354 videos, 128 hours). We also test MSN's ability to predict performance of Computer Aided Detection (CADe) of polyps on colonoscopies from both countries, even though MSN is not trained on data from Japan.
$\textbf{Results}$: MSN correctly identifies NBI and CE as less similar to Israel whitelight than Japan whitelight (bootstrapped z-test, |z| > 496, p < 10^-8 for both) using the label-free Frechet distance. MSN detects NBI with 99% accuracy, predicts CE better than our heuristic (90% vs 79% accuracy) despite being trained only on whitelight, and is the only method that is robust to noisy labels. MSN predicts CADe polyp detector performance on in-domain Israel and out-of-domain Japan colonoscopies (r=0.79, 0.37 respectively). With few examples of Japan detector performance to train on, MSN prediction of Japan performance improves (r=0.56).
$\textbf{Conclusion}$: Our technique can identify distribution shifts in clinical data and can predict CADe detector performance on unseen data, without labels. Our self-supervised approach can aid in detecting when data in practice is different from training, such as between hospitals or data has meaningfully shifted from training. MSN has potential for application to medical image domains beyond colonoscopy. - [1456] arXiv:2403.09930 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features CriticsComments: The first two authors contributed equally to this workSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: A key aspect of intelligence is the ability to demonstrate a broad spectrum of behaviors for adapting to unexpected situations. Over the past decade, advancements in deep reinforcement learning have led to groundbreaking achievements to solve complex continuous control tasks. However, most approaches return only one solution specialized for a specific problem. We introduce Quality-Diversity Actor-Critic (QDAC), an off-policy actor-critic deep reinforcement learning algorithm that leverages a value function critic and a successor features critic to learn high-performing and diverse behaviors. In this framework, the actor optimizes an objective that seamlessly unifies both critics using constrained optimization to (1) maximize return, while (2) executing diverse skills. Compared with other Quality-Diversity methods, QDAC achieves significantly higher performance and more diverse behaviors on six challenging continuous control locomotion tasks. We also demonstrate that we can harness the learned skills to adapt better than other baselines to five perturbed environments. Finally, qualitative analyses showcase a range of remarkable behaviors, available at: this http URL .
- [1457] arXiv:2403.09940 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Global Convergence Guarantees for Federated Policy Gradient Methods with AdversariesComments: 27 pages, 6 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: Federated Reinforcement Learning (FRL) allows multiple agents to collaboratively build a decision making policy without sharing raw trajectories. However, if a small fraction of these agents are adversarial, it can lead to catastrophic results. We propose a policy gradient based approach that is robust to adversarial agents which can send arbitrary values to the server. Under this setting, our results form the first global convergence guarantees with general parametrization. These results demonstrate resilience with adversaries, while achieving sample complexity of order $\tilde{\mathcal{O}}\left( \frac{1}{\epsilon^2} \left( \frac{1}{N-f} + \frac{f^2}{(N-f)^2}\right)\right)$, where $N$ is the total number of agents and $f$ is the number of adversarial agents.
- [1458] arXiv:2403.09948 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-trainingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The integration of artificial intelligence (AI) with radiology has marked a transformative era in medical diagnostics. Vision foundation models have been adopted to enhance radiologic imaging analysis. However, the distinct complexities of radiological imaging, including the interpretation of 2D and 3D radiological data, pose unique challenges that existing models, trained on general non-medical images, fail to address adequately. To bridge this gap and capitalize on the diagnostic precision required in medical imaging, we introduce RadCLIP: a pioneering cross-modal foundational model that harnesses Contrastive Language-Image Pre-training (CLIP) to refine radiologic image analysis. RadCLIP incorporates a novel 3D slice pooling mechanism tailored for volumetric image analysis and is trained using a comprehensive and diverse dataset of radiologic image-text pairs. Our evaluations demonstrate that RadCLIP effectively aligns radiological images with their corresponding textual annotations, and in the meantime, offers a robust vision backbone for radiologic imagery with significant promise.
- [1459] arXiv:2403.09963 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge ExtractionComments: Accepted by COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction, i.e., prompts tend to introduce biases toward specific labels. Prompt bias presents a significant challenge in assessing the factual knowledge within PLMs. Therefore, this paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias. We show that: 1) all prompts in the experiments exhibit non-negligible bias, with gradient-based prompts like AutoPrompt and OptiPrompt displaying significantly higher levels of bias; 2) prompt bias can amplify benchmark accuracy unreasonably by overfitting the test datasets, especially on imbalanced datasets like LAMA. Based on these findings, we propose a representation-based approach to mitigate the prompt bias during inference time. Specifically, we first estimate the biased representation using prompt-only querying, and then remove it from the model's internal representations to generate the debiased representations, which are used to produce the final debiased outputs. Experiments across various prompts, PLMs, and benchmarks show that our approach can not only correct the overfitted performance caused by prompt bias, but also significantly improve the prompt retrieval capability (up to 10% absolute performance gain). These results indicate that our approach effectively alleviates prompt bias in knowledge evaluation, thereby enhancing the reliability of benchmark assessments. Hopefully, our plug-and-play approach can be a golden standard to strengthen PLMs toward reliable knowledge bases. Code and data are released in this https URL .
- [1460] arXiv:2403.09974 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category DiscoverySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes, leveraging the class concepts learned from labeled samples. Current GCD methods only use a single visual modality of information, resulting in poor classification of visually similar classes. Though certain classes are visually confused, their text information might be distinct, motivating us to introduce text information into the GCD task. However, the lack of class names for unlabelled data makes it impractical to utilize text information. To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP's text encoder to generate pseudo text embeddings. Besides, we employ a dual-branch framework, through the joint learning and instance consistency of different modality branches, visual and semantic information mutually enhance each other, promoting the interaction and fusion of visual and text embedding space. Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks, achieving new state-of-the-art. The code will be released at \url{ this https URL }.
- [1461] arXiv:2403.09977 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: EfficientVMamba: Atrous Selective Scan for Light Weight Visual MambaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands $\mathcal{O}(N^2)$. This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to $\mathcal{O}(N)$. Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with $1.3$G FLOPs improves Vim-Ti with $1.5$G FLOPs by a large margin of $5.6\%$ accuracy on ImageNet. Code is available at: \url{ this https URL }.
- [1462] arXiv:2403.09998 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: FBPT: A Fully Binary Point TransformerComments: Accepted to ICRA 2024. arXiv admin note: substantial text overlap with arXiv:2303.01166Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a novel Fully Binary Point Cloud Transformer (FBPT) model which has the potential to be widely applied and expanded in the fields of robotics and mobile devices. By compressing the weights and activations of a 32-bit full-precision network to 1-bit binary values, the proposed binary point cloud Transformer network significantly reduces the storage footprint and computational resource requirements of neural network models for point cloud processing tasks, compared to full-precision point cloud networks. However, achieving a fully binary point cloud Transformer network, where all parts except the modules specific to the task are binary, poses challenges and bottlenecks in quantizing the activations of Q, K, V and self-attention in the attention module, as they do not adhere to simple probability distributions and can vary with input data. Furthermore, in our network, the binary attention module undergoes a degradation of the self-attention module due to the uniform distribution that occurs after the softmax operation. The primary focus of this paper is on addressing the performance degradation issue caused by the use of binary point cloud Transformer modules. We propose a novel binarization mechanism called dynamic-static hybridization. Specifically, our approach combines static binarization of the overall network model with fine granularity dynamic binarization of data-sensitive components. Furthermore, we make use of a novel hierarchical training scheme to obtain the optimal model and binarization parameters. These above improvements allow the proposed binarization method to outperform binarization methods applied to convolution neural networks when used in point cloud Transformer structures. To demonstrate the superiority of our algorithm, we conducted experiments on two different tasks: point cloud classification and place recognition.
- [1463] arXiv:2403.10014 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: NNCTC: Physical Layer Cross-Technology Communication via Neural NetworksComments: 12 pagesSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: Cross-technology communication(CTC) enables seamless interactions between diverse wireless technologies. Most existing work is based on reversing the transmission path to identify the appropriate payload to generate the waveform that the target devices can recognize. However, this method suffers from many limitations, including dependency on specific technologies and the necessity for intricate algorithms to mitigate distortion. In this work, we present NNCTC, a Neural-Network-based Cross-Technology Communication framework inspired by the adaptability of trainable neural models in wireless communications. By converting signal processing components within the CTC pipeline into neural models, the NNCTC is designed for end-to-end training without requiring labeled data. This enables the NNCTC system to autonomously derive the optimal CTC payload, which significantly eases the development complexity and showcases the scalability potential for various CTC links. Particularly, we construct a CTC system from Wi-Fi to ZigBee. The NNCTC system outperforms the well-recognized WEBee and WIDE design in error performance, achieving an average packet reception rate(PRR) of 92.3% and an average symbol error rate(SER) as low as 1.3%.
- [1464] arXiv:2403.10024 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument LeakageSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Abstract: This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models.
- [1465] arXiv:2403.10039 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Rethinking Low-quality Optical Flow in Unsupervised Surgical Instrument SegmentationPeiran Wu , Yang Liu , Jiayu Huo , Gongyu Zhang , Christos Bergeles , Rachel Sparks , Prokar Dasgupta , Alejandro Granados , Sebastien OurselinSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Video-based surgical instrument segmentation plays an important role in robot-assisted surgeries. Unlike supervised settings, unsupervised segmentation relies heavily on motion cues, which are challenging to discern due to the typically lower quality of optical flow in surgical footage compared to natural scenes. This presents a considerable burden for the advancement of unsupervised segmentation techniques. In our work, we address the challenge of enhancing model performance despite the inherent limitations of low-quality optical flow. Our methodology employs a three-pronged approach: extracting boundaries directly from the optical flow, selectively discarding frames with inferior flow quality, and employing a fine-tuning process with variable frame rates. We thoroughly evaluate our strategy on the EndoVis2017 VOS dataset and Endovis2017 Challenge dataset, where our model demonstrates promising results, achieving a mean Intersection-over-Union (mIoU) of 0.75 and 0.72, respectively. Our findings suggest that our approach can greatly decrease the need for manual annotations in clinical environments and may facilitate the annotation process for new datasets. The code is available at this https URL
- [1466] arXiv:2403.10041 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Towards Embedding Dynamic Personas in Interactive Robots: Masquerading Animated Social Kinematics (MASK)Jeongeun Park , Taemoon Jeong , Hyeonseong Kim , Taehyun Byun , Seungyoon Shin , Keunjun Choi , Jaewoon Kwon , Taeyoon Lee , Matthew Pan , Sungjoon ChoiComments: 4 pages, 3 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents the design and development of an innovative interactive robotic system to enhance audience engagement using character-like personas. Built upon the foundations of persona-driven dialog agents, this work extends the agent application to the physical realm, employing robots to provide a more immersive and interactive experience. The proposed system, named the Masquerading Animated Social Kinematics (MASK), leverages an anthropomorphic robot which interacts with guests using non-verbal interactions, including facial expressions and gestures. A behavior generation system based upon a finite-state machine structure effectively conditions robotic behavior to convey distinct personas. The MASK framework integrates a perception engine, a behavior selection engine, and a comprehensive action library to enable real-time, dynamic interactions with minimal human intervention in behavior design. Throughout the user subject studies, we examined whether the users could recognize the intended character in film-character-based persona conditions. We conclude by discussing the role of personas in interactive agents and the factors to consider for creating an engaging user experience.
- [1467] arXiv:2403.10049 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: PPM : A Pre-trained Plug-in Model for Click-through Rate PredictionComments: Accepted by ACM Web Conference 2024 (WWW'24)Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Click-through rate (CTR) prediction is a core task in recommender systems. Existing methods (IDRec for short) rely on unique identities to represent distinct users and items that have prevailed for decades. On one hand, IDRec often faces significant performance degradation on cold-start problem; on the other hand, IDRec cannot use longer training data due to constraints imposed by iteration efficiency. Most prior studies alleviate the above problems by introducing pre-trained knowledge(e.g. pre-trained user model or multi-modal embeddings). However, the explosive growth of online latency can be attributed to the huge parameters in the pre-trained model. Therefore, most of them cannot employ the unified model of end-to-end training with IDRec in industrial recommender systems, thus limiting the potential of the pre-trained model. To this end, we propose a $\textbf{P}$re-trained $\textbf{P}$lug-in CTR $\textbf{M}$odel, namely PPM. PPM employs multi-modal features as input and utilizes large-scale data for pre-training. Then, PPM is plugged in IDRec model to enhance unified model's performance and iteration efficiency. Upon incorporating IDRec model, certain intermediate results within the network are cached, with only a subset of the parameters participating in training and serving. Hence, our approach can successfully deploy an end-to-end model without causing huge latency increases. Comprehensive offline experiments and online A/B testing at JD E-commerce demonstrate the efficiency and effectiveness of PPM.
- [1468] arXiv:2403.10056 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Don't Half-listen: Capturing Key-part Information in Continual Instruction TuningYongquan He , Xuancheng Huang , Minghao Tang , Lingxun Meng , Xiang Li , Wei Lin , Wenyuan Zhang , Yifu GaoComments: 18 pages, 4 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Instruction tuning for large language models (LLMs) can drive them to produce results consistent with human goals in specific downstream tasks. However, the process of continual instruction tuning (CIT) for LLMs may bring about the catastrophic forgetting (CF) problem, where previously learned abilities are degraded. Recent methods try to alleviate the CF problem by modifying models or replaying data, which may only remember the surface-level pattern of instructions and get confused on held-out tasks. In this paper, we propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG). Our method computes the information gain on masked parts to dynamically replay data and refine the training objective, which enables LLMs to capture task-aware information relevant to the correct response and alleviate overfitting to general descriptions in instructions. In addition, we propose two metrics, P-score and V-score, to measure the generalization and instruction-following abilities of LLMs. Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
- [1469] arXiv:2403.10063 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Unified Projection-Free Algorithms for Adversarial DR-Submodular OptimizationComments: This paper is published in ICLR 2024. This version includes a correction for regret bounds in the full-information zeroth order feedback setting (see the footnote on page 1 for details)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Optimization and Control (math.OC)
Abstract: This paper introduces unified projection-free Frank-Wolfe type algorithms for adversarial continuous DR-submodular optimization, spanning scenarios such as full information and (semi-)bandit feedback, monotone and non-monotone functions, different constraints, and types of stochastic queries. For every problem considered in the non-monotone setting, the proposed algorithms are either the first with proven sub-linear $\alpha$-regret bounds or have better $\alpha$-regret bounds than the state of the art, where $\alpha$ is a corresponding approximation bound in the offline setting. In the monotone setting, the proposed approach gives state-of-the-art sub-linear $\alpha$-regret bounds among projection-free algorithms in 7 of the 8 considered cases while matching the result of the remaining case. Additionally, this paper addresses semi-bandit and bandit feedback for adversarial DR-submodular optimization, advancing the understanding of this optimization area.
- [1470] arXiv:2403.10069 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Boundary Matters: A Bi-Level Active Finetuning FrameworkSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The pretraining-finetuning paradigm has gained widespread adoption in vision tasks and other fields, yet it faces the significant challenge of high sample annotation costs. To mitigate this, the concept of active finetuning has emerged, aiming to select the most appropriate samples for model finetuning within a limited budget. Traditional active learning methods often struggle in this setting due to their inherent bias in batch selection. Furthermore, the recent active finetuning approach has primarily concentrated on aligning the distribution of selected subsets with the overall data pool, focusing solely on diversity. In this paper, we propose a Bi-Level Active Finetuning framework to select the samples for annotation in one shot, which includes two stages: core sample selection for diversity, and boundary sample selection for uncertainty. The process begins with the identification of pseudo-class centers, followed by an innovative denoising method and an iterative strategy for boundary sample selection in the high-dimensional feature space, all without relying on ground-truth labels. Our comprehensive experiments provide both qualitative and quantitative evidence of our method's efficacy, outperforming all the existing baselines.
- [1471] arXiv:2403.10079 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning Physical Dynamics for Object-centric Visual PredictionComments: 13 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The ability to model the underlying dynamics of visual scenes and reason about the future is central to human intelligence. Many attempts have been made to empower intelligent systems with such physical understanding and prediction abilities. However, most existing methods focus on pixel-to-pixel prediction, which suffers from heavy computational costs while lacking a deep understanding of the physical dynamics behind videos. Recently, object-centric prediction methods have emerged and attracted increasing interest. Inspired by it, this paper proposes an unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects. Our model consists of two modules, perceptual, and dynamic module. The perceptual module is utilized to decompose images into several objects and synthesize images with a set of object-centric representations. The dynamic module fuses contextual information, takes environment-object and object-object interaction into account, and predicts the future trajectory of objects. Extensive experiments are conducted to validate the effectiveness of the proposed method. Both quantitative and qualitative experimental results demonstrate that our model generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.
- [1472] arXiv:2403.10086 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Large Language Models to Generate System-Level Test Programs Targeting Non-functional PropertiesDenis Schwachhofer , Peter Domanski , Steffen Becker , Stefan Wagner , Matthias Sauer , Dirk Pflüger , Ilia PolianComments: Testmethoden und Zuverlässigkeit von Schaltungen und Systemen, TuZ 2024Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Programming Languages (cs.PL)
Abstract: System-Level Test (SLT) has been a part of the test flow for integrated circuits for over a decade and still gains importance. However, no systematic approaches exist for test program generation, especially targeting non-functional properties of the Device under Test (DUT). Currently, test engineers manually compose test suites from off-the-shelf software, approximating the end-user environment of the DUT. This is a challenging and tedious task that does not guarantee sufficient control over non-functional properties. This paper proposes Large Language Models (LLMs) to generate test programs. We take a first glance at how pre-trained LLMs perform in test program generation to optimize non-functional properties of the DUT. Therefore, we write a prompt to generate C code snippets that maximize the instructions per cycle of a super-scalar, out-of-order architecture in simulation. Additionally, we apply prompt and hyperparameter optimization to achieve the best possible results without further training.
- [1473] arXiv:2403.10088 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIFAmey Hengle , Aswini Kumar , Sahajpreet Singh , Anil Bandhakavi , Md Shad Akhtar , Tanmoy ChakrobortySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Counterspeech, defined as a response to mitigate online hate speech, is increasingly used as a non-censorial solution. Addressing hate speech effectively involves dispelling the stereotypes, prejudices, and biases often subtly implied in brief, single-sentence statements or abuses. These implicit expressions challenge language models, especially in seq2seq tasks, as model performance typically excels with longer contexts. Our study introduces CoARL, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements. CoARL's first two phases involve sequential multi-instruction tuning, teaching the model to understand intents, reactions, and harms of offensive statements, and then learning task-specific low-rank adapter weights for generating intent-conditioned counterspeech. The final phase uses reinforcement learning to fine-tune outputs for effectiveness and non-toxicity. CoARL outperforms existing benchmarks in intent-conditioned counterspeech generation, showing an average improvement of 3 points in intent-conformity and 4 points in argument-quality metrics. Extensive human evaluation supports CoARL's efficacy in generating superior and more context-appropriate responses compared to existing systems, including prominent LLMs like ChatGPT.
- [1474] arXiv:2403.10097 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Adaptive Random Feature Regularization on Fine-tuning Deep Neural NetworksComments: Accepted to CVPR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: While fine-tuning is a de facto standard method for training deep neural networks, it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However, these methods require auxiliary source information (e.g., source labels or datasets) or heavy additional computations. In this paper, we propose a simple method called adaptive random feature regularization (AdaRand). AdaRand helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs. To this end, AdaRand minimizes the gap between feature vectors and random reference vectors that are sampled from class conditional Gaussian distributions. Furthermore, AdaRand dynamically updates the conditional distribution to follow the currently updated feature extractors and balance the distance between classes in feature spaces. Our experiments show that AdaRand outperforms the other fine-tuning regularization, which requires auxiliary source information and heavy computation costs.
- [1475] arXiv:2403.10105 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Belief Aided Navigation using Bayesian Reinforcement Learning for Avoiding Humans in Blind SpotsComments: 8 pages, 4 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent research on mobile robot navigation has focused on socially aware navigation in crowded environments. However, existing methods do not adequately account for human robot interactions and demand accurate location information from omnidirectional sensors, rendering them unsuitable for practical applications. In response to this need, this study introduces a novel algorithm, BNBRL+, predicated on the partially observable Markov decision process framework to assess risks in unobservable areas and formulate movement strategies under uncertainty. BNBRL+ consolidates belief algorithms with Bayesian neural networks to probabilistically infer beliefs based on the positional data of humans. It further integrates the dynamics between the robot, humans, and inferred beliefs to determine the navigation paths and embeds social norms within the reward function, thereby facilitating socially aware navigation. Through experiments in various risk laden scenarios, this study validates the effectiveness of BNBRL+ in navigating crowded environments with blind spots. The model's ability to navigate effectively in spaces with limited visibility and avoid obstacles dynamically can significantly improve the safety and reliability of autonomous vehicles.
- [1476] arXiv:2403.10107 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated ReasoningComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant strides in accurate detection on specific datasets, they still lack the general reasoning ability like human beings to effectively induce HOI relationships. In this study, we propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a Cross-Agents Reasoning scheme to leverage the LLM conduct reasoning from different aspects. In the second stage, we perform Multi-LLMs Debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we devise an auxiliary training strategy that utilizes CLIP, a large vision-language model to enhance the base V-HOI models' discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the prediction accuracy of the base V-HOI model via reasoning from multiple perspectives.
- [1477] arXiv:2403.10110 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Meta Operator for Complex Query Answering on Knowledge GraphsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Abstract: Knowledge graphs contain informative factual knowledge but are considered incomplete. To answer complex queries under incomplete knowledge, learning-based Complex Query Answering (CQA) models are proposed to directly learn from the query-answer samples to avoid the direct traversal of incomplete graph data. Existing works formulate the training of complex query answering models as multi-task learning and require a large number of training samples. In this work, we explore the compositional structure of complex queries and argue that the different logical operator types, rather than the different complex query types, are the key to improving generalizability. Accordingly, we propose a meta-learning algorithm to learn the meta-operators with limited data and adapt them to different instances of operators under various complex queries. Empirical results show that learning meta-operators is more effective than learning original CQA or meta-CQA models.
- [1478] arXiv:2403.10131 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: RAFT: Adapting Language Model to Domain Specific RAGTianjun Zhang , Shishir G. Patil , Naman Jain , Sheng Shen , Matei Zaharia , Ion Stoica , Joseph E. GonzalezSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally bake in new knowledge (e.g., time-critical news, or private domain knowledge) into the pretrained model either through RAG-based-prompting, or fine-tuning. However, the optimal methodology for the model to gain such new knowledge remains an open question. In this paper, we present Retrieval Augmented FineTuning (RAFT), a training recipe that improves the model's ability to answer questions in a "open-book" in-domain settings. In RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don't help in answering the question, which we call, distractor documents. RAFT accomplishes this by citing verbatim the right sequence from the relevant document that would help answer the question. This coupled with RAFT's chain-of-thought-style response helps improve the model's ability to reason. In domain-specific RAG, RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG. RAFT's code and demo are open-sourced at this http URL .
- [1479] arXiv:2403.10135 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: The Whole is Better than the Sum: Using Aggregated Demonstrations in In-Context Learning for Sequential RecommendationComments: NAACL 2024 (Findings)Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Large language models (LLMs) have shown excellent performance on various NLP tasks. To use LLMs as strong sequential recommenders, we explore the in-context learning approach to sequential recommendation. We investigate the effects of instruction format, task consistency, demonstration selection, and number of demonstrations. As increasing the number of demonstrations in ICL does not improve accuracy despite using a long prompt, we propose a novel method called LLMSRec-Syn that incorporates multiple demonstration users into one aggregated demonstration. Our experiments on three recommendation datasets show that LLMSRec-Syn outperforms state-of-the-art LLM-based sequential recommendation methods. In some cases, LLMSRec-Syn can perform on par with or even better than supervised learning methods. Our code is publicly available at this https URL .
- [1480] arXiv:2403.10136 (cross-list from stat.ME) [ pdf , ps , other ]
-
Title: Response Style Characterization for Repeated Measures Using the Visual Analogue ScaleShunsuke Minusa , Tadayuki Matsumura , Kanako Esaki , Yang Shao , Chihiro Yoshimura , Hiroyuki MizunoComments: 13 pages, 7 figures, submitted to IEEE AccessSubjects: Methodology (stat.ME) ; Artificial Intelligence (cs.AI)
Abstract: Self-report measures (e.g., Likert scales) are widely used to evaluate subjective health perceptions. Recently, the visual analog scale (VAS), a slider-based scale, has become popular owing to its ability to precisely and easily assess how people feel. These data can be influenced by the response style (RS), a user-dependent systematic tendency that occurs regardless of questionnaire instructions. Despite its importance, especially in between-individual analysis, little attention has been paid to handling the RS in the VAS (denoted as response profile (RP)), as it is mainly used for within-individual monitoring and is less affected by RP. However, VAS measurements often require repeated self-reports of the same questionnaire items, making it difficult to apply conventional methods on a Likert scale. In this study, we developed a novel RP characterization method for various types of repeatedly measured VAS data. This approach involves the modeling of RP as distributional parameters ${\theta}$ through a mixture of RS-like distributions, and addressing the issue of unbalanced data through bootstrap sampling for treating repeated measures. We assessed the effectiveness of the proposed method using simulated pseudo-data and an actual dataset from an empirical study. The assessment of parameter recovery showed that our method accurately estimated the RP parameter ${\theta}$, demonstrating its robustness. Moreover, applying our method to an actual VAS dataset revealed the presence of individual RP heterogeneity, even in repeated VAS measurements, similar to the findings of the Likert scale. Our proposed method enables RP heterogeneity-aware VAS data analysis, similar to Likert-scale data analysis.
- [1481] arXiv:2403.10144 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: NLP Verification: Towards a General Methodology for Certifying RobustnessMarco Casadio , Tanvi Dinkar , Ekaterina Komendantskaya , Luca Arnaboldi , Omri Isac , Matthew L. Daggitt , Guy Katz , Verena Rieser , Oliver LemonSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
Abstract: Deep neural networks have exhibited substantial success in the field of Natural Language Processing (NLP) and ensuring their safety and reliability is crucial: there are safety critical contexts where such models must be robust to variability or attack, and give guarantees over their output. Unlike Computer Vision, NLP lacks a unified verification methodology and, despite recent advancements in literature, they are often light on the pragmatical issues of NLP verification. In this paper, we make an attempt to distil and evaluate general components of an NLP verification pipeline, that emerges from the progress in the field to date. Our contributions are two-fold. Firstly, we give a general characterisation of verifiable subspaces that result from embedding sentences into continuous spaces. We identify, and give an effective method to deal with, the technical challenge of semantic generalisability of verified subspaces; and propose it as a standard metric in the NLP verification pipelines (alongside with the standard metrics of model accuracy and model verifiability). Secondly, we propose a general methodology to analyse the effect of the embedding gap, a problem that refers to the discrepancy between verification of geometric subpspaces on the one hand, and semantic meaning of sentences which the geometric subspaces are supposed to represent, on the other hand. In extreme cases, poor choices in embedding of sentences may invalidate verification results. We propose a number of practical NLP methods that can help to identify the effects of the embedding gap; and in particular we propose the metric of falsifiability of semantic subpspaces as another fundamental metric to be reported as part of the NLP verification pipeline. We believe that together these general principles pave the way towards a more consolidated and effective development of this new domain.
- [1482] arXiv:2403.10158 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Functional Graph Convolutional Networks: A unified multi-task and multi-modal learning framework to facilitate health and social-care insightsTobia Boschi , Francesca Bonin , Rodrigo Ordonez-Hurtado , Cécile Rousseau , Alessandra Pascale , John DinsmoreSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces a novel Functional Graph Convolutional Network (funGCN) framework that combines Functional Data Analysis and Graph Convolutional Networks to address the complexities of multi-task and multi-modal learning in digital health and longitudinal studies. With the growing importance of health solutions to improve health care and social support, ensure healthy lives, and promote well-being at all ages, funGCN offers a unified approach to handle multivariate longitudinal data for multiple entities and ensures interpretability even with small sample sizes. Key innovations include task-specific embedding components that manage different data types, the ability to perform classification, regression, and forecasting, and the creation of a knowledge graph for insightful data interpretation. The efficacy of funGCN is validated through simulation experiments and a real-data application.
- [1483] arXiv:2403.10164 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CoReEcho: Continuous Representation Learning for 2D+time Echocardiography AnalysisSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep learning (DL) models have been advancing automatic medical image analysis on various modalities, including echocardiography, by offering a comprehensive end-to-end training pipeline. This approach enables DL models to regress ejection fraction (EF) directly from 2D+time echocardiograms, resulting in superior performance. However, the end-to-end training pipeline makes the learned representations less explainable. The representations may also fail to capture the continuous relation among echocardiogram clips, indicating the existence of spurious correlations, which can negatively affect the generalization. To mitigate this issue, we propose CoReEcho, a novel training framework emphasizing continuous representations tailored for direct EF regression. Our extensive experiments demonstrate that CoReEcho: 1) outperforms the current state-of-the-art (SOTA) on the largest echocardiography dataset (EchoNet-Dynamic) with MAE of 3.90 & R2 of 82.44, and 2) provides robust and generalizable features that transfer more effectively in related downstream tasks. The code is publicly available at this https URL .
- [1484] arXiv:2403.10173 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Hybrid SNN-ANN Network for Event-based Object Detection with Spatial and Temporal AttentionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Event cameras offer high temporal resolution and dynamic range with minimal motion blur, making them promising for object detection tasks. While Spiking Neural Networks (SNNs) are a natural match for event-based sensory data and enable ultra-energy efficient and low latency inference on neuromorphic hardware, Artificial Neural Networks (ANNs) tend to display more stable training dynamics and faster convergence resulting in greater task performance. Hybrid SNN-ANN approaches are a promising alternative, enabling to leverage the strengths of both SNN and ANN architectures. In this work, we introduce the first Hybrid Attention-based SNN-ANN backbone for object detection using event cameras. We propose a novel Attention-based SNN-ANN bridge module to capture sparse spatial and temporal relations from the SNN layer and convert them into dense feature maps for the ANN part of the backbone. Experimental results demonstrate that our proposed method surpasses baseline hybrid and SNN-based approaches by significant margins, with results comparable to existing ANN-based methods. Extensive ablation studies confirm the effectiveness of our proposed modules and architectural choices. These results pave the way toward a hybrid SNN-ANN architecture that achieves ANN like performance at a drastically reduced parameter budget. We implemented the SNN blocks on digital neuromorphic hardware to investigate latency and power consumption and demonstrate the feasibility of our approach.
- [1485] arXiv:2403.10175 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Short Survey on Importance Weighting for Machine LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Importance weighting is a fundamental procedure in statistics and machine learning that weights the objective function or probability distribution based on the importance of the instance in some sense. The simplicity and usefulness of the idea has led to many applications of importance weighting. For example, it is known that supervised learning under an assumption about the difference between the training and test distributions, called distribution shift, can guarantee statistically desirable properties through importance weighting by their density ratio. This survey summarizes the broad applications of importance weighting in machine learning and related research.
- [1486] arXiv:2403.10187 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Grasp Anything: Combining Teacher-Augmented Policy Gradient Learning with Instance Segmentation to Grasp Arbitrary ObjectsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Interactive grasping from clutter, akin to human dexterity, is one of the longest-standing problems in robot learning. Challenges stem from the intricacies of visual perception, the demand for precise motor skills, and the complex interplay between the two. In this work, we present Teacher-Augmented Policy Gradient (TAPG), a novel two-stage learning framework that synergizes reinforcement learning and policy distillation. After training a teacher policy to master the motor control based on object pose information, TAPG facilitates guided, yet adaptive, learning of a sensorimotor policy, based on object segmentation. We zero-shot transfer from simulation to a real robot by using Segment Anything Model for promptable object segmentation. Our trained policies adeptly grasp a wide variety of objects from cluttered scenarios in simulation and the real world based on human-understandable prompts. Furthermore, we show robust zero-shot transfer to novel objects. Videos of our experiments are available at \url{ this https URL }.
- [1487] arXiv:2403.10190 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Perceptual Quality-based Model Training under Annotator Label UncertaintySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Annotators exhibit disagreement during data labeling, which can be termed as annotator label uncertainty. Annotator label uncertainty manifests in variations of labeling quality. Training with a single low-quality annotation per sample induces model reliability degradations. In this work, we first examine the effects of annotator label uncertainty in terms of the model's generalizability and prediction uncertainty. We observe that the model's generalizability and prediction uncertainty degrade with the presence of low-quality noisy labels. Meanwhile, our evaluation of existing uncertainty estimation algorithms indicates their incapability in response to annotator label uncertainty. To mitigate performance degradation, prior methods show that training models with labels collected from multiple independent annotators can enhance generalizability. However, they require massive annotations. Hence, we introduce a novel perceptual quality-based model training framework to objectively generate multiple labels for model training to enhance reliability, while avoiding massive annotations. Specifically, we first select a subset of samples with low perceptual quality scores ranked by statistical regularities of visual signals. We then assign de-aggregated labels to each sample in this subset to obtain a training set with multiple labels. Our experiments and analysis demonstrate that training with the proposed framework alleviates the degradation of generalizability and prediction uncertainty caused by annotator label uncertainty.
- [1488] arXiv:2403.10202 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Learning on JPEG-LDPC Compressed Images: Classifying with SyndromesComments: 5 pages, 3 figures, conference paper, submitted to the EUSIPCO 2024 ConferenceSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
Abstract: In goal-oriented communications, the objective of the receiver is often to apply a Deep-Learning model, rather than reconstructing the original data. In this context, direct learning over compressed data, without any prior decoding, holds promise for enhancing the time-efficient execution of inference models at the receiver. However, conventional entropic-coding methods like Huffman and Arithmetic break data structure, rendering them unsuitable for learning without decoding. In this paper, we propose an alternative approach in which entropic coding is realized with Low-Density Parity Check (LDPC) codes. We hypothesize that Deep Learning models can more effectively exploit the internal code structure of LDPC codes. At the receiver, we leverage a specific class of Recurrent Neural Networks (RNNs), specifically Gated Recurrent Unit (GRU), trained for image classification. Our numerical results indicate that classification based on LDPC-coded bit-planes surpasses Huffman and Arithmetic coding, while necessitating a significantly smaller learning model. This demonstrates the efficiency of classification directly from LDPC-coded data, eliminating the need for any form of decompression, even partial, prior to applying the learning model.
- [1489] arXiv:2403.10205 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Read between the lines -- Functionality Extraction From READMEsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: While text summarization is a well-known NLP task, in this paper, we introduce a novel and useful variant of it called functionality extraction from Git README files. Though this task is a text2text generation at an abstract level, it involves its own peculiarities and challenges making existing text2text generation systems not very useful. The motivation behind this task stems from a recent surge in research and development activities around the use of large language models for code-related tasks, such as code refactoring, code summarization, etc. We also release a human-annotated dataset called FuncRead, and develop a battery of models for the task. Our exhaustive experimentation shows that small size fine-tuned models beat any baseline models that can be designed using popular black-box or white-box large language models (LLMs) such as ChatGPT and Bard. Our best fine-tuned 7 Billion CodeLlama model exhibit 70% and 20% gain on the F1 score against ChatGPT and Bard respectively.
- [1490] arXiv:2403.10216 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Exploring Optical Flow Inclusion into nnU-Net Framework for Surgical Instrument SegmentationMarcos Fernández-Rodríguez , Bruno Silva , Sandro Queirós , Helena R. Torres , Bruno Oliveira , Pedro Morais , Lukas R. Buschle , Jorge Correia-Pinto , Estevão Lima , João L. VilaçaJournal-ref: Proceedings Volume 12928, Medical Imaging 2024: Image-Guided Procedures, Robotic Interventions, and Modeling; 1292827 (2024)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Surgical instrument segmentation in laparoscopy is essential for computer-assisted surgical systems. Despite the Deep Learning progress in recent years, the dynamic setting of laparoscopic surgery still presents challenges for precise segmentation. The nnU-Net framework excelled in semantic segmentation analyzing single frames without temporal information. The framework's ease of use, including its ability to be automatically configured, and its low expertise requirements, have made it a popular base framework for comparisons. Optical flow (OF) is a tool commonly used in video tasks to estimate motion and represent it in a single frame, containing temporal information. This work seeks to employ OF maps as an additional input to the nnU-Net architecture to improve its performance in the surgical instrument segmentation task, taking advantage of the fact that instruments are the main moving objects in the surgical field. With this new input, the temporal component would be indirectly added without modifying the architecture. Using CholecSeg8k dataset, three different representations of movement were estimated and used as new inputs, comparing them with a baseline model. Results showed that the use of OF maps improves the detection of classes with high movement, even when these are scarce in the dataset. To further improve performance, future work may focus on implementing other OF-preserving augmentations.
- [1491] arXiv:2403.10220 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: From Chaos to Clarity: Time Series Anomaly Detection in Astronomical ObservationsComments: accepted by ICDE 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: With the development of astronomical facilities, large-scale time series data observed by these facilities is being collected. Analyzing anomalies in these astronomical observations is crucial for uncovering potential celestial events and physical phenomena, thus advancing the scientific research process. However, existing time series anomaly detection methods fall short in tackling the unique characteristics of astronomical observations where each star is inherently independent but interfered by random concurrent noise, resulting in a high rate of false alarms. To overcome the challenges, we propose AERO, a novel two-stage framework tailored for unsupervised anomaly detection in astronomical observations. In the first stage, we employ a Transformer-based encoder-decoder architecture to learn the normal temporal patterns on each variate (i.e., star) in alignment with the characteristic of variate independence. In the second stage, we enhance the graph neural network with a window-wise graph structure learning to tackle the occurrence of concurrent noise characterized by spatial and temporal randomness. In this way, AERO is not only capable of distinguishing normal temporal patterns from potential anomalies but also effectively differentiating concurrent noise, thus decreasing the number of false alarms. We conducted extensive experiments on three synthetic datasets and three real-world datasets. The results demonstrate that AERO outperforms the compared baselines. Notably, compared to the state-of-the-art model, AERO improves the F1-score by up to 8.76% and 2.63% on synthetic and real-world datasets respectively.
- [1492] arXiv:2403.10228 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HawkEye: Training Video-Text LLMs for Grounding Text in VideosSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images. In this paper, we propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans, with which we introduce two new time-aware training objectives to video-text LLMs. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives. Extensive experiments show that HawkEye is better at temporal video grounding and comparable on other video-text tasks with existing video-text LLMs, which verifies its superior video-text multi-modal understanding abilities.
- [1493] arXiv:2403.10231 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Less is More: One-shot Subgraph Reasoning on Large-scale Knowledge GraphsComments: 32 pages, 43 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: To deduce new facts on a knowledge graph (KG), a link predictor learns from the graph structure and collects local evidence to find the answer to a given query. However, existing methods suffer from a severe scalability problem due to the utilization of the whole KG for prediction, which hinders their promise on large scale KGs and cannot be directly addressed by vanilla sampling methods. In this work, we propose the one-shot-subgraph link prediction to achieve efficient and adaptive prediction. The design principle is that, instead of directly acting on the whole KG, the prediction procedure is decoupled into two steps, i.e., (i) extracting only one subgraph according to the query and (ii) predicting on this single, query dependent subgraph. We reveal that the non-parametric and computation-efficient heuristics Personalized PageRank (PPR) can effectively identify the potential answers and supporting evidence. With efficient subgraph-based prediction, we further introduce the automated searching of the optimal configurations in both data and model spaces. Empirically, we achieve promoted efficiency and leading performances on five large-scale benchmarks. The code is publicly available at: this https URL .
- [1494] arXiv:2403.10259 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Comprehensive Study Of Predictive Maintenance In Industries Using Classification Models And LSTM ModelSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In today's technology-driven era, the imperative for predictive maintenance and advanced diagnostics extends beyond aviation to encompass the identification of damages, failures, and operational defects in rotating and moving machines. Implementing such services not only curtails maintenance costs but also extends machine lifespan, ensuring heightened operational efficiency. Moreover, it serves as a preventive measure against potential accidents or catastrophic events. The advent of Artificial Intelligence (AI) has revolutionized maintenance across industries, enabling more accurate and efficient prediction and analysis of machine failures, thereby conserving time and resources. Our proposed study aims to delve into various machine learning classification techniques, including Support Vector Machine (SVM), Random Forest, Logistic Regression, and Convolutional Neural Network LSTM-Based, for predicting and analyzing machine performance. SVM classifies data into different categories based on their positions in a multidimensional space, while Random Forest employs ensemble learning to create multiple decision trees for classification. Logistic Regression predicts the probability of binary outcomes using input data. The primary objective of the study is to assess these algorithms' performance in predicting and analyzing machine performance, considering factors such as accuracy, precision, recall, and F1 score. The findings will aid maintenance experts in selecting the most suitable machine learning algorithm for effective prediction and analysis of machine performance.
- [1495] arXiv:2403.10275 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: A Question on the Explainability of Large Language Models and the Word-Level Univariate First-Order Plausibility AssumptionComments: 7 pages, 10 figures, Accepted and presented at AAAI 2024 (ReLM workshop)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The explanations of large language models have recently been shown to be sensitive to the randomness used for their training, creating a need to characterize this sensitivity. In this paper, we propose a characterization that questions the possibility to provide simple and informative explanations for such models. To this end, we give statistical definitions for the explanations' signal, noise and signal-to-noise ratio. We highlight that, in a typical case study where word-level univariate explanations are analyzed with first-order statistical tools, the explanations of simple feature-based models carry more signal and less noise than those of transformer ones. We then discuss the possibility to improve these results with alternative definitions of signal and noise that would capture more complex explanations and analysis methods, while also questioning the tradeoff with their plausibility for readers.
- [1496] arXiv:2403.10281 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Team Trifecta at Factify5WQA: Setting the Standard in Fact Verification with Fine-TuningComments: Accepted by AAAI 2024 Workshop: FACTIFY 3.0 - Workshop Series on Multimodal Fact-Checking and Hate Speech DetectionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper, we present Pre-CoFactv3, a comprehensive framework comprised of Question Answering and Text Classification components for fact verification. Leveraging In-Context Learning, Fine-tuned Large Language Models (LLMs), and the FakeNet model, we address the challenges of fact verification. Our experiments explore diverse approaches, comparing different Pre-trained LLMs, introducing FakeNet, and implementing various ensemble methods. Notably, our team, Trifecta, secured first place in the AAAI-24 Factify 3.0 Workshop, surpassing the baseline accuracy by 103% and maintaining a 70% lead over the second competitor. This success underscores the efficacy of our approach and its potential contributions to advancing fact verification research.
- [1497] arXiv:2403.10288 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Rough Transformers for Continuous and Efficient Time-Series ModellingSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of moderate lengths and greater. To mitigate this, we introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global dependencies in input data, while remaining robust to changes in the sequence length and sampling frequency. We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models using a fraction of the computational time and memory resources on synthetic and real-world time-series tasks.
- [1498] arXiv:2403.10326 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language ModelComments: Findings of short paper, EMNLP 2022Journal-ref: chiang-etal-2022-cdgpSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Manually designing cloze test consumes enormous time and efforts. The major challenge lies in wrong option (distractor) selection. Having carefully-design distractors improves the effectiveness of learner ability assessment. As a result, the idea of automatically generating cloze distractor is motivated. In this paper, we investigate cloze distractor generation by exploring the employment of pre-trained language models (PLMs) as an alternative for candidate distractor generation. Experiments show that the PLM-enhanced model brings a substantial performance improvement. Our best performing model advances the state-of-the-art result from 14.94 to 34.17 (NDCG@10 score). Our code and dataset is available at this https URL .
- [1499] arXiv:2403.10327 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Unsupervised Threat Hunting using Continuous Bag-of-Terms-and-Time (CBoTT)Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Threat hunting is sifting through system logs to detect malicious activities that might have bypassed existing security measures. It can be performed in several ways, one of which is based on detecting anomalies. We propose an unsupervised framework, called continuous bag-of-terms-and-time (CBoTT), and publish its application programming interface (API) to help researchers and cybersecurity analysts perform anomaly-based threat hunting among SIEM logs geared toward process auditing on endpoint devices. Analyses show that our framework consistently outperforms benchmark approaches. When logs are sorted by likelihood of being an anomaly (from most likely to least), our approach identifies anomalies at higher percentiles (between 1.82-6.46) while benchmark approaches identify the same anomalies at lower percentiles (between 3.25-80.92). This framework can be used by other researchers to conduct benchmark analyses and cybersecurity analysts to find anomalies in SIEM logs.
- [1500] arXiv:2403.10365 (cross-list from cs.DS) [ pdf , ps , html , other ]
-
Title: Scalable Algorithms for Individual Preference Stable ClusteringComments: 59 pages, 9 figures, submitted to AIStats2024Subjects: Data Structures and Algorithms (cs.DS) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: In this paper, we study the individual preference (IP) stability, which is an notion capturing individual fairness and stability in clustering. Within this setting, a clustering is $\alpha$-IP stable when each data point's average distance to its cluster is no more than $\alpha$ times its average distance to any other cluster. In this paper, we study the natural local search algorithm for IP stable clustering. Our analysis confirms a $O(\log n)$-IP stability guarantee for this algorithm, where $n$ denotes the number of points in the input. Furthermore, by refining the local search approach, we show it runs in an almost linear time, $\tilde{O}(nk)$.
- [1501] arXiv:2403.10371 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: An Energy-Efficient Ensemble Approach for Mitigating Data Incompleteness in IoT ApplicationsComments: 8 pages, 8 figures, 1 table, Accepted as a conference paper at IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SMART SYSTEMS AND THE INTERNET OF THINGS (DCOSS-IoT 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Abstract: Machine Learning (ML) is becoming increasingly important for IoT-based applications. However, the dynamic and ad-hoc nature of many IoT ecosystems poses unique challenges to the efficacy of ML algorithms. One such challenge is data incompleteness, which is manifested as missing sensor readings. Many factors, including sensor failures and/or network disruption, can cause data incompleteness. Furthermore, most IoT systems are severely power-constrained. It is important that we build IoT-based ML systems that are robust against data incompleteness while simultaneously being energy efficient. This paper presents an empirical study of SECOE - a recent technique for alleviating data incompleteness in IoT - with respect to its energy bottlenecks. Towards addressing the energy bottlenecks of SECOE, we propose ENAMLE - a proactive, energy-aware technique for mitigating the impact of concurrent missing data. ENAMLE is unique in the sense that it builds an energy-aware ensemble of sub-models, each trained with a subset of sensors chosen carefully based on their correlations. Furthermore, at inference time, ENAMLE adaptively alters the number of the ensemble of models based on the amount of missing data rate and the energy-accuracy trade-off. ENAMLE's design includes several novel mechanisms for minimizing energy consumption while maintaining accuracy. We present extensive experimental studies on two distinct datasets that demonstrate the energy efficiency of ENAMLE and its ability to alleviate sensor failures.
- [1502] arXiv:2403.10380 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: BirdSet: A Multi-Task Benchmark for Classification in Computational Avian BioacousticsLukas Rauch , Raphael Schwinger , Moritz Wirth , René Heinrich , Jonas Lange , Stefan Kahl , Bernhard Sick , Sven Tomforde , Christoph ScholzComments: Work in progress, to be submitted @DMLR next monthSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Deep learning (DL) models have emerged as a powerful tool in avian bioacoustics to diagnose environmental health and biodiversity. However, inconsistencies in research pose notable challenges hindering progress. Reliable DL models need to analyze bird calls flexibly across various species and environments to fully harness the potential of bioacoustics in a cost-effective passive acoustic monitoring scenario. Data fragmentation and opacity across studies complicate a comprehensive evaluation of model performance. To overcome these challenges, we present the BirdSet benchmark, a unified framework consolidating research efforts with a holistic approach for the classification of bird vocalizations in computational avian bioacoustics. BirdSet aggregates open-source bird recordings into a curated dataset collection. This unified approach provides an in-depth understanding of model performance and identifies potential shortcomings across different tasks. By providing baseline results of current models, we aim to facilitate comparability and ease accessibility for newcomers. Additionally, we release an open-source package \benchmark containing a comprehensive data pipeline that enables easy and fast model evaluation, available at this https URL .
- [1503] arXiv:2403.10401 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: SculptDiff: Learning Robotic Clay Sculpting from Humans with Goal Conditioned Diffusion PolicySubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Manipulating deformable objects remains a challenge within robotics due to the difficulties of state estimation, long-horizon planning, and predicting how the object will deform given an interaction. These challenges are the most pronounced with 3D deformable objects. We propose SculptDiff, a goal-conditioned diffusion-based imitation learning framework that works with point cloud state observations to directly learn clay sculpting policies for a variety of target shapes. To the best of our knowledge this is the first real-world method that successfully learns manipulation policies for 3D deformable objects. For sculpting videos and access to our dataset and hardware CAD models, see the project website: this https URL
- [1504] arXiv:2403.10403 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Energy Correction Model in the Feature Space for Out-of-Distribution DetectionComments: NeurIPS ML Safety Workshop (2022)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this work, we study the out-of-distribution (OOD) detection problem through the use of the feature space of a pre-trained deep classifier. We show that learning the density of in-distribution (ID) features with an energy-based models (EBM) leads to competitive detection results. However, we found that the non-mixing of MCMC sampling during the EBM's training undermines its detection performance. To overcome this an energy-based correction of a mixture of class-conditional Gaussian distributions. We obtains favorable results when compared to a strong baseline like the KNN detector on the CIFAR-10/CIFAR-100 OOD detection benchmarks.
- [1505] arXiv:2403.10425 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: NeuFlow: Real-time, High-accuracy Optical Flow Estimation on Robots Using Edge DevicesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Real-time high-accuracy optical flow estimation is a crucial component in various applications, including localization and mapping in robotics, object tracking, and activity recognition in computer vision. While recent learning-based optical flow methods have achieved high accuracy, they often come with heavy computation costs. In this paper, we propose a highly efficient optical flow architecture, called NeuFlow, that addresses both high accuracy and computational cost concerns. The architecture follows a global-to-local scheme. Given the features of the input images extracted at different spatial resolutions, global matching is employed to estimate an initial optical flow on the 1/16 resolution, capturing large displacement, which is then refined on the 1/8 resolution with lightweight CNN layers for better accuracy. We evaluate our approach on Jetson Orin Nano and RTX 2080 to demonstrate efficiency improvements across different computing platforms. We achieve a notable 10x-80x speedup compared to several state-of-the-art methods, while maintaining comparable accuracy. Our approach achieves around 30 FPS on edge computing platforms, which represents a significant breakthrough in deploying complex computer vision tasks such as SLAM on small robots like drones. The full training and evaluation code is available at this https URL .
- [1506] arXiv:2403.10433 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: AI-enhanced Collective Intelligence: The State of the Art and ProspectsComments: 27 pages, 2 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: The current societal challenges exceed the capacity of human individual or collective effort alone. As AI evolves, its role within human collectives is poised to vary from an assistive tool to a participatory member. Humans and AI possess complementary capabilities that, when synergized, can achieve a level of collective intelligence that surpasses the collective capabilities of either humans or AI in isolation. However, the interactions in human-AI systems are inherently complex, involving intricate processes and interdependencies. This review incorporates perspectives from network science to conceptualize a multilayer representation of human-AI collective intelligence, comprising a cognition layer, a physical layer, and an information layer. Within this multilayer network, humans and AI agents exhibit varying characteristics; humans differ in diversity from surface-level to deep-level attributes, while AI agents range in degrees of functionality and anthropomorphism. The interplay among these agents shapes the overall structure and dynamics of the system. We explore how agents' diversity and interactions influence the system's collective intelligence. Furthermore, we present an analysis of real-world instances of AI-enhanced collective intelligence. We conclude by addressing the potential challenges in AI-enhanced collective intelligence and offer perspectives on future developments in this field.
- [1507] arXiv:2403.10438 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Data Ethics Emergency Drill: A Toolbox for Discussing Responsible AI for Industry TeamsComments: accepted to CHI 2024Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Researchers urge technology practitioners such as data scientists to consider the impacts and ethical implications of algorithmic decisions. However, unlike programming, statistics, and data management, discussion of ethical implications is rarely included in standard data science training. To begin to address this gap, we designed and tested a toolbox called the data ethics emergency drill (DEED) to help data science teams discuss and reflect on the ethical implications of their work. The DEED is a roleplay of a fictional ethical emergency scenario that is contextually situated in the team's specific workplace and applications. This paper outlines the DEED toolbox and describes three studies carried out with two different data science teams that iteratively shaped its design. Our findings show that practitioners can apply lessons learnt from the roleplay to real-life situations, and how the DEED opened up conversations around ethics and values.
- [1508] arXiv:2403.10454 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Partially Observable Task and Motion Planning with Uncertainty and Risk AwarenessAidan Curtis , George Matheos , Nishad Gothoskar , Vikash Mansinghka , Joshua Tenenbaum , Tomás Lozano-Pérez , Leslie Pack KaelblingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Integrated task and motion planning (TAMP) has proven to be a valuable approach to generalizable long-horizon robotic manipulation and navigation problems. However, the typical TAMP problem formulation assumes full observability and deterministic action effects. These assumptions limit the ability of the planner to gather information and make decisions that are risk-aware. We propose a strategy for TAMP with Uncertainty and Risk Awareness (TAMPURA) that is capable of efficiently solving long-horizon planning problems with initial-state and action outcome uncertainty, including problems that require information gathering and avoiding undesirable and irreversible outcomes. Our planner reasons under uncertainty at both the abstract task level and continuous controller level. Given a set of closed-loop goal-conditioned controllers operating in the primitive action space and a description of their preconditions and potential capabilities, we learn a high-level abstraction that can be solved efficiently and then refined to continuous actions for execution. We demonstrate our approach on several robotics problems where uncertainty is a crucial factor and show that reasoning under uncertainty in these problems outperforms previously proposed determinized planning, direct search, and reinforcement learning strategies. Lastly, we demonstrate our planner on two real-world robotics problems using recent advancements in probabilistic perception.
- [1509] arXiv:2403.10460 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Online Concurrent Multi-Robot Coverage Path PlanningSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Recently, centralized receding horizon online multi-robot coverage path planning algorithms have shown remarkable scalability in thoroughly exploring large, complex, unknown workspaces with many robots. In a horizon, the path planning and the path execution interleave, meaning when the path planning occurs for robots with no paths, the robots with outstanding paths do not execute, and subsequently, when the robots with new or outstanding paths execute to reach respective goals, path planning does not occur for those robots yet to get new paths, leading to wastage of both the robotic and the computation resources. As a remedy, we propose a centralized algorithm that is not horizon-based. It plans paths at any time for a subset of robots with no paths, i.e., who have reached their previously assigned goals, while the rest execute their outstanding paths, thereby enabling concurrent planning and execution. We formally prove that the proposed algorithm ensures complete coverage of an unknown workspace and analyze its time complexity. To demonstrate scalability, we evaluate our algorithm to cover eight large $2$D grid benchmark workspaces with up to 512 aerial and ground robots, respectively. A comparison with a state-of-the-art horizon-based algorithm shows its superiority in completing the coverage with up to 1.6x speedup. For validation, we perform ROS + Gazebo simulations in six 2D grid benchmark workspaces with 10 quadcopters and TurtleBots, respectively. We also successfully conducted one outdoor experiment with three quadcopters and one indoor with two TurtleBots.
- [1510] arXiv:2403.10462 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Safety Cases: How to Justify the Safety of Advanced AI SystemsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them. To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.
- [1511] arXiv:2403.10482 (cross-list from q-fin.CP) [ pdf , ps , other ]
-
Title: Can a GPT4-Powered AI Agent Be a Good Enough Performance Attribution Analyst?Subjects: Computational Finance (q-fin.CP) ; Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
Abstract: Performance attribution analysis, defined as the process of explaining the drivers of the excess performance of an investment portfolio against a benchmark, stands as a significant feature of portfolio management and plays a crucial role in the investment decision-making process, particularly within the fund management industry. Rooted in a solid financial and mathematical framework, the importance and methodologies of this analytical technique are extensively documented across numerous academic research papers and books. The integration of large language models (LLMs) and AI agents marks a groundbreaking development in this field. These agents are designed to automate and enhance the performance attribution analysis by accurately calculating and analyzing portfolio performances against benchmarks. In this study, we introduce the application of an AI Agent for a variety of essential performance attribution tasks, including the analysis of performance drivers and utilizing LLMs as calculation engine for multi-level attribution analysis and question-answering (QA) tasks. Leveraging advanced prompt engineering techniques such as Chain-of-Thought (CoT) and Plan and Solve (PS), and employing a standard agent framework from LangChain, the research achieves promising results: it achieves accuracy rates exceeding 93% in analyzing performance drivers, attains 100% in multi-level attribution calculations, and surpasses 84% accuracy in QA exercises that simulate official examination standards. These findings affirm the impactful role of AI agents, prompt engineering and evaluation in advancing portfolio management processes, highlighting a significant development in the practical application and evaluation of Generative AI technologies within the domain.
- [1512] arXiv:2403.10487 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Stimulate the Potential of Robots via CompetitionSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: It is common for us to feel pressure in a competition environment, which arises from the desire to obtain success comparing with other individuals or opponents. Although we might get anxious under the pressure, it could also be a drive for us to stimulate our potentials to the best in order to keep up with others. Inspired by this, we propose a competitive learning framework which is able to help individual robot to acquire knowledge from the competition, fully stimulating its dynamics potential in the race. Specifically, the competition information among competitors is introduced as the additional auxiliary signal to learn advantaged actions. We further build a Multiagent-Race environment, and extensive experiments are conducted, demonstrating that robots trained in competitive environments outperform ones that are trained with SoTA algorithms in single robot environment.
- [1513] arXiv:2403.10499 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot StudySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.
- [1514] arXiv:2403.10506 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and ManipulationSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Humanoid robots hold great promise in assisting humans in diverse environments and tasks, due to their flexibility and adaptability leveraging human-like morphology. However, research in humanoid robots is often bottlenecked by the costly and fragile hardware setups. To accelerate algorithmic research in humanoid robots, we present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands and a variety of challenging whole-body manipulation and locomotion tasks. Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning baseline achieves superior performance when supported by robust low-level policies, such as walking or reaching. With HumanoidBench, we provide the robotics community with a platform to identify the challenges arising when solving diverse tasks with humanoid robots, facilitating prompt verification of algorithms and ideas. The open-source code is available at this https URL .
- [1515] arXiv:2403.10516 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FeatUp: A Model-Agnostic Framework for Features at Any ResolutionComments: Accepted to the International Conference on Learning Representations (ICLR) 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
- [1516] arXiv:2403.10517 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VideoAgent: Long-form Video Understanding with Large Language Model as AgentSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Abstract: Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
- [1517] arXiv:2403.10534 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VISREAS: Complex Visual Reasoning with Unanswerable QuestionsComments: 18 pages, 14 figures, 5 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.
- [1518] arXiv:2403.10538 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: MATADOR: Automated System-on-Chip Tsetlin Machine Design Generation for Edge ApplicationsSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: System-on-Chip Field-Programmable Gate Arrays (SoC-FPGAs) offer significant throughput gains for machine learning (ML) edge inference applications via the design of co-processor accelerator systems. However, the design effort for training and translating ML models into SoC-FPGA solutions can be substantial and requires specialist knowledge aware trade-offs between model performance, power consumption, latency and resource utilization. Contrary to other ML algorithms, Tsetlin Machine (TM) performs classification by forming logic proposition between boolean actions from the Tsetlin Automata (the learning elements) and boolean input features. A trained TM model, usually, exhibits high sparsity and considerable overlapping of these logic propositions both within and among the classes. The model, thus, can be translated to RTL-level design using a miniscule number of AND and NOT gates. This paper presents MATADOR, an automated boolean-to-silicon tool with GUI interface capable of implementing optimized accelerator design of the TM model onto SoC-FPGA for inference at the edge. It offers automation of the full development pipeline: model training, system level design generation, design verification and deployment. It makes use of the logic sharing that ensues from propositional overlap and creates a compact design by effectively utilizing the TM model's sparsity. MATADOR accelerator designs are shown to be up to 13.4x faster, up to 7x more resource frugal and up to 2x more power efficient when compared to the state-of-the-art Quantized and Binary Deep Neural Network implementations.
- [1519] arXiv:2403.10544 (cross-list from stat.AP) [ pdf , ps , html , other ]
-
Title: Process-Aware Analysis of Treatment Paths in Heart Failure Patients: A Case StudyHarry H. Beyel , Marlo Verket , Viki Peeva , Christian Rennert , Marco Pegoraro , Katharina Schütt , Wil M.P. van der Aalst , Nikolaus MarxComments: 10 pages, 3 figures, 9 tables, 31 referencesSubjects: Applications (stat.AP) ; Artificial Intelligence (cs.AI)
Abstract: Process mining in healthcare presents a range of challenges when working with different types of data within the healthcare domain. There is high diversity considering the variety of data collected from healthcare processes: operational processes given by claims data, a collection of events during surgery, data related to pre-operative and post-operative care, and high-level data collections based on regular ambulant visits with no apparent events. In this case study, a data set from the last category is analyzed. We apply process-mining techniques on sparse patient heart failure data and investigate whether an information gain towards several research questions is achievable. Here, available data are transformed into an event log format, and process discovery and conformance checking are applied. Additionally, patients are split into different cohorts based on comorbidities, such as diabetes and chronic kidney disease, and multiple statistics are compared between the cohorts. Conclusively, we apply decision mining to determine whether a patient will have a cardiovascular outcome and whether a patient will die.
- [1520] arXiv:2403.10547 (cross-list from math.OC) [ pdf , ps , html , other ]
-
Title: Robust Second-Order Nonconvex Optimization and Its Application to Low Rank Matrix SensingSubjects: Optimization and Control (math.OC) ; Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Abstract: Finding an approximate second-order stationary point (SOSP) is a well-studied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings.
In this paper, we study the problem of finding SOSPs in the strong contamination model, where a constant fraction of datapoints are arbitrarily corrupted. We introduce a general framework for efficiently finding an approximate SOSP with \emph{dimension-independent} accuracy guarantees, using $\widetilde{O}({D^2}/{\epsilon})$ samples where $D$ is the ambient dimension and $\epsilon$ is the fraction of corrupted datapoints.
As a concrete application of our framework, we apply it to the problem of low rank matrix sensing, developing efficient and provably robust algorithms that can tolerate corruptions in both the sensing matrices and the measurements. In addition, we establish a Statistical Query lower bound providing evidence that the quadratic dependence on $D$ in the sample complexity is necessary for computationally efficient algorithms. - [1521] arXiv:2403.10550 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Semi-Supervised Learning for Anomaly Traffic Detection via Bidirectional Normalizing FlowsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: With the rapid development of the Internet, various types of anomaly traffic are threatening network security. We consider the problem of anomaly network traffic detection and propose a three-stage anomaly detection framework using only normal traffic. Our framework can generate pseudo anomaly samples without prior knowledge of anomalies to achieve the detection of anomaly data. Firstly, we employ a reconstruction method to learn the deep representation of normal samples. Secondly, these representations are normalized to a standard normal distribution using a bidirectional flow module. To simulate anomaly samples, we add noises to the normalized representations which are then passed through the generation direction of the bidirectional flow module. Finally, a simple classifier is trained to differentiate the normal samples and pseudo anomaly samples in the latent space. During inference, our framework requires only two modules to detect anomalous samples, leading to a considerable reduction in model size. According to the experiments, our method achieves the state of-the-art results on the common benchmarking datasets of anomaly network traffic detection. The code is given in the this https URL
- [1522] arXiv:2403.10552 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Training Self-localization Models for Unseen Unfamiliar Places via Teacher-to-Student Data-Free Knowledge TransferComments: 7 pages, 3 figures, technical reportSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract: A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places ("teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and thereafter used for continual learning of the student model. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, such that it can handle various types of open-set teachers, including uncooperative, untrainable (e.g., image retrieval engines), and blackbox teachers (i.e., data privacy). Rather than relying on the availability of private data of teachers as in existing methods, we propose to exploit an assumption that holds universally in self-localization tasks: "The teacher model is a self-localization system" and to reuse the self-localization system of a teacher as a sole accessible communication channel. We particularly focus on designing an excellent student/questioner whose interactions with teachers can yield effective question-and-answer sequences that can be used as pseudo-training datasets for the student self-localization model. When applied to a generic recursive knowledge distillation scenario, our approach exhibited stable and consistent performance improvement.
- [1523] arXiv:2403.10553 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning to Watermark LLM-generated Text via Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low - only training an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with a fixed LLM. We open-source the code: this https URL .
- [1524] arXiv:2403.10555 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: KARINA: An Efficient Deep Learning Model for Global Weather ForecastSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
Abstract: Deep learning-based, data-driven models are gaining prevalence in climate research, particularly for global weather prediction. However, training the global weather data at high resolution requires massive computational resources. Therefore, we present a new model named KARINA to overcome the substantial computational demands typical of this field. This model achieves forecasting accuracy comparable to higher-resolution counterparts with significantly less computational resources, requiring only 4 NVIDIA A100 GPUs and less than 12 hours of training. KARINA combines ConvNext, SENet, and Geocyclic Padding to enhance weather forecasting at a 2.5° resolution, which could filter out high-frequency noise. Geocyclic Padding preserves pixels at the lateral boundary of the input image, thereby maintaining atmospheric flow continuity in the spherical Earth. SENet dynamically improves feature response, advancing atmospheric process modeling, particularly in the vertical column process as numerous channels. In this vein, KARINA sets new benchmarks in weather forecasting accuracy, surpassing existing models like the ECMWF S2S reforecasts at a lead time of up to 7 days. Remarkably, KARINA achieved competitive performance even when compared to the recently developed models (Pangu-Weather, GraphCast, ClimaX, and FourCastNet) trained with high-resolution data having 100 times larger pixels. Conclusively, KARINA significantly advances global weather forecasting by efficiently modeling Earth's atmosphere with improved accuracy and resource efficiency.
- [1525] arXiv:2403.10557 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Second-Order Information Matters: Revisiting Machine Unlearning for Large Language ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: With the rapid development of Large Language Models (LLMs), we have witnessed intense competition among the major LLM products like ChatGPT, LLaMa, and Gemini. However, various issues (e.g. privacy leakage and copyright violation) of the training corpus still remain underexplored. For example, the Times sued OpenAI and Microsoft for infringing on its copyrights by using millions of its articles for training. From the perspective of LLM practitioners, handling such unintended privacy violations can be challenging. Previous work addressed the ``unlearning" problem of LLMs using gradient information, while they mostly introduced significant overheads like data preprocessing or lacked robustness. In this paper, contrasting with the methods based on first-order information, we revisit the unlearning problem via the perspective of second-order information (Hessian). Our unlearning algorithms, which are inspired by classic Newton update, are not only data-agnostic/model-agnostic but also proven to be robust in terms of utility preservation or privacy guarantee. Through a comprehensive evaluation with four NLP datasets as well as a case study on real-world datasets, our methods consistently show superiority over the first-order methods.
- [1526] arXiv:2403.10559 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AIComments: 9 pages, 2 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: This report investigates the history and impact of Generative Models and Connected and Automated Vehicles (CAVs), two groundbreaking forces pushing progress in technology and transportation. By focusing on the application of generative models within the context of CAVs, the study aims to unravel how this integration could enhance predictive modeling, simulation accuracy, and decision-making processes in autonomous vehicles. This thesis discusses the benefits and challenges of integrating generative models and CAV technology in transportation. It aims to highlight the progress made, the remaining obstacles, and the potential for advancements in safety and innovation.
- [1527] arXiv:2403.10561 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A collection of the accepted papers for the Human-Centric Representation Learning workshop at AAAI 2024Dimitris Spathis , Aaqib Saeed , Ali Etemad , Sana Tonekaboni , Stefanos Laskaridis , Shohreh Deldari , Chi Ian Tang , Patrick Schwab , Shyam TailorSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This non-archival index is not complete, as some accepted papers chose to opt-out of inclusion. The list of all accepted papers is available on the workshop website.
- [1528] arXiv:2403.10562 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Counter-Samples: A Stateless Strategy to Neutralize Black Box Adversarial AttacksSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Our paper presents a novel defence against black box attacks, where attackers use the victim model as an oracle to craft their adversarial examples. Unlike traditional preprocessing defences that rely on sanitizing input samples, our stateless strategy counters the attack process itself. For every query we evaluate a counter-sample instead, where the counter-sample is the original sample optimized against the attacker's objective. By countering every black box query with a targeted white box optimization, our strategy effectively introduces an asymmetry to the game to the defender's advantage. This defence not only effectively misleads the attacker's search for an adversarial example, it also preserves the model's accuracy on legitimate inputs and is generic to multiple types of attacks.
We demonstrate that our approach is remarkably effective against state-of-the-art black box attacks and outperforms existing defences for both the CIFAR-10 and ImageNet datasets. Additionally, we also show that the proposed defence is robust against strong adversaries as well. - [1529] arXiv:2403.10565 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: PTSD-MDNN : Fusion tardive de r\'eseaux de neurones profonds multimodaux pour la d\'etection du trouble de stress post-traumatiqueComments: in French language. GRETSI 2023Subjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
Abstract: In order to provide a more objective and quicker way to diagnose post-traumatic stress disorder (PTSD), we present PTSD-MDNN which merges two unimodal convolutional neural networks and which gives low detection error rate. By taking only videos and audios as inputs, the model could be used in the configuration of teleconsultation sessions, in the optimization of patient journeys or for human-robot interaction.
- [1530] arXiv:2403.10566 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Cooling-Guide Diffusion Model for Battery Cell ArrangementSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Our study introduces a Generative AI method that employs a cooling-guided diffusion model to optimize the layout of battery cells, a crucial step for enhancing the cooling performance and efficiency of battery thermal management systems. Traditional design processes, which rely heavily on iterative optimization and extensive guesswork, are notoriously slow and inefficient, often leading to suboptimal solutions. In contrast, our innovative method uses a parametric denoising diffusion probabilistic model (DDPM) with classifier and cooling guidance to generate optimized cell layouts with enhanced cooling paths, significantly lowering the maximum temperature of the cells. By incorporating position-based classifier guidance, we ensure the feasibility of generated layouts. Meanwhile, cooling guidance directly optimizes cooling-efficiency, making our approach uniquely effective. When compared to two advanced models, the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) and the Conditional Tabular GAN (CTGAN), our cooling-guided diffusion model notably outperforms both. It is five times more effective than TabDDPM and sixty-six times better than CTGAN across key metrics such as feasibility, diversity, and cooling efficiency. This research marks a significant leap forward in the field, aiming to optimize battery cell layouts for superior cooling efficiency, thus setting the stage for the development of more effective and dependable battery thermal management systems.
- [1531] arXiv:2403.10568 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt ExpertsComments: Extended version of arXiv:2312.03734Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Prompt-tuning has demonstrated parameter-efficiency in fusing unimodal foundation models for multimodal tasks. However, its limited adaptivity and expressiveness lead to suboptimal performance when compared with other tuning methods. In this paper, we address this issue by disentangling the vanilla prompts to adaptively capture dataset-level and instance-level features. Building upon this disentanglement, we introduce the mixture of prompt experts (MoPE) technique to enhance expressiveness. MoPE leverages multimodal pairing priors to route the most effective prompt on a per-instance basis. Compared to vanilla prompting, our MoPE-based conditional prompting exhibits greater expressiveness for multimodal fusion, scaling better with the training data and the overall number of trainable parameters. We also study a regularization term for expert routing, leading to emergent expert specialization, where different experts focus on different concepts, enabling interpretable soft prompting. Extensive experiments across three multimodal datasets demonstrate that our method achieves state-of-the-art results, matching or even surpassing the performance of fine-tuning, while requiring only 0.8% of the trainable parameters. Code will be released: this https URL .
- [1532] arXiv:2403.10569 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Achieving Pareto Optimality using Efficient Parameter Reduction for DNNs in Resource-Constrained Edge EnvironmentAtah Nuh Mih , Alireza Rahimi , Asfia Kawnine , Francis Palma , Monica Wachowicz , Rickey Dubay , Hung CaoComments: arXiv admin note: text overlap with arXiv:2401.05355Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: This paper proposes an optimization of an existing Deep Neural Network (DNN) that improves its hardware utilization and facilitates on-device training for resource-constrained edge environments. We implement efficient parameter reduction strategies on Xception that shrink the model size without sacrificing accuracy, thus decreasing memory utilization during training. We evaluate our model in two experiments: Caltech-101 image classification and PCB defect detection and compare its performance against the original Xception and lightweight models, EfficientNetV2B1 and MobileNetV2. The results of the Caltech-101 image classification show that our model has a better test accuracy (76.21%) than Xception (75.89%), uses less memory on average (847.9MB) than Xception (874.6MB), and has faster training and inference times. The lightweight models overfit with EfficientNetV2B1 having a 30.52% test accuracy and MobileNetV2 having a 58.11% test accuracy. Both lightweight models have better memory usage than our model and Xception. On the PCB defect detection, our model has the best test accuracy (90.30%), compared to Xception (88.10%), EfficientNetV2B1 (55.25%), and MobileNetV2 (50.50%). MobileNetV2 has the least average memory usage (849.4MB), followed by our model (865.8MB), then EfficientNetV2B1 (874.8MB), and Xception has the highest (893.6MB). We further experiment with pre-trained weights and observe that memory usage decreases thereby showing the benefits of transfer learning. A Pareto analysis of the models' performance shows that our optimized model architecture satisfies accuracy and low memory utilization objectives.
- [1533] arXiv:2403.10570 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Symbiotic Game and Foundation Models for Cyber Deception Operations in Strategic Cyber WarfareSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Abstract: We are currently facing unprecedented cyber warfare with the rapid evolution of tactics, increasing asymmetry of intelligence, and the growing accessibility of hacking tools. In this landscape, cyber deception emerges as a critical component of our defense strategy against increasingly sophisticated attacks. This chapter aims to highlight the pivotal role of game-theoretic models and foundation models (FMs) in analyzing, designing, and implementing cyber deception tactics. Game models (GMs) serve as a foundational framework for modeling diverse adversarial interactions, allowing us to encapsulate both adversarial knowledge and domain-specific insights. Meanwhile, FMs serve as the building blocks for creating tailored machine learning models suited to given applications. By leveraging the synergy between GMs and FMs, we can advance proactive and automated cyber defense mechanisms by not only securing our networks against attacks but also enhancing their resilience against well-planned operations. This chapter discusses the games at the tactical, operational, and strategic levels of warfare, delves into the symbiotic relationship between these methodologies, and explores relevant applications where such a framework can make a substantial impact in cybersecurity. The chapter discusses the promising direction of the multi-agent neurosymbolic conjectural learning (MANSCOL), which allows the defender to predict adversarial behaviors, design adaptive defensive deception tactics, and synthesize knowledge for the operational level synthesis and adaptation. FMs serve as pivotal tools across various functions for MANSCOL, including reinforcement learning, knowledge assimilation, formation of conjectures, and contextual representation. This chapter concludes with a discussion of the challenges associated with FMs and their application in the domain of cybersecurity.
- [1534] arXiv:2403.10575 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Exploring Language Model's Code Generation Ability with Auxiliary FunctionsComments: NAACL2024 FindingsSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Auxiliary function is a helpful component to improve language model's code generation ability. However, a systematic exploration of how they affect has yet to be done. In this work, we comprehensively evaluate the ability to utilize auxiliary functions encoded in recent code-pretrained language models. First, we construct a human-crafted evaluation set, called HumanExtension, which contains examples of two functions where one function assists the other. With HumanExtension, we design several experiments to examine their ability in a multifaceted way. Our evaluation processes enable a comprehensive understanding of including auxiliary functions in the prompt in terms of effectiveness and robustness. An additional implementation style analysis captures the models' various implementation patterns when they access the auxiliary function. Through this analysis, we discover the models' promising ability to utilize auxiliary functions including their self-improving behavior by implementing the two functions step-by-step. However, our analysis also reveals the model's underutilized behavior to call the auxiliary function, suggesting the future direction to enhance their implementation by eliciting the auxiliary function call ability encoded in the models. We release our code and dataset to facilitate this research direction.
- [1535] arXiv:2403.10581 (cross-list from q-bio.QM) [ pdf , ps , html , other ]
-
Title: Large Language Model-informed ECG Dual Attention Network for Heart Failure Risk PredictionComments: Under journal revisionSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract: Heart failure (HF) poses a significant public health challenge, with a rising global mortality rate. Early detection and prevention of HF could significantly reduce its impact. We introduce a novel methodology for predicting HF risk using 12-lead electrocardiograms (ECGs). We present a novel, lightweight dual-attention ECG network designed to capture complex ECG features essential for early HF risk prediction, despite the notable imbalance between low and high-risk groups. This network incorporates a cross-lead attention module and twelve lead-specific temporal attention modules, focusing on cross-lead interactions and each lead's local dynamics. To further alleviate model overfitting, we leverage a large language model (LLM) with a public ECG-Report dataset for pretraining on an ECG-report alignment task. The network is then fine-tuned for HF risk prediction using two specific cohorts from the UK Biobank study, focusing on patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI).The results reveal that LLM-informed pre-training substantially enhances HF risk prediction in these cohorts. The dual-attention design not only improves interpretability but also predictive accuracy, outperforming existing competitive methods with C-index scores of 0.6349 for UKB-HYP and 0.5805 for UKB-MI. This demonstrates our method's potential in advancing HF risk assessment with clinical complex ECG data.
- [1536] arXiv:2403.10585 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient ViewpointComments: Accepted and to Appear, AISTATS 2024Subjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guidance score function of the input image, we propose Diffusion Policy Gradient (DPG), a tractable computation method by viewing the intermediate noisy images as policies and the target image as the states selected by the policy. Experiments show that our method is robust to both Gaussian and Poisson noise degradation on multiple linear and non-linear inverse tasks, resulting into a higher image restoration quality on FFHQ, ImageNet and LSUN datasets.
- [1537] arXiv:2403.10586 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: From Algorithms to Outcomes: Reviewing AI's Role in Non-Muscle-Invasive Bladder Cancer Recurrence PredictionComments: 16 pages, 4 FiguresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Bladder cancer, the leading urinary tract cancer, is responsible for 15 deaths daily in the UK. This cancer predominantly manifests as non-muscle-invasive bladder cancer (NMIBC), characterised by tumours not yet penetrating the muscle layer of the bladder wall. NMIBC is plagued by a very high recurrence rate of 70-80% and hence the costliest treatments. Current tools for predicting recurrence use scoring systems that overestimate risk and have poor accuracy. Inaccurate and delayed prediction of recurrence significantly elevates the likelihood of mortality. Accurate prediction of recurrence is hence vital for cost-effective management and treatment planning. This is where Machine learning (ML) techniques have emerged as a promising approach for predicting NMIBC recurrence by leveraging molecular and clinical data. This review provides a comprehensive analysis of ML approaches for predicting NMIBC recurrence. Our systematic evaluation demonstrates the potential of diverse ML algorithms and markers, including radiomic, clinical, histopathological, genomic, and biochemical data in enhancing recurrence prediction and personalised patient management. We summarise various prediction tasks, data modalities, and ML models used, highlighting their performance, limitations, and future directions of incorporating cost-effectiveness. Challenges related to generalisability and interpretability of artificial intelligent models are discussed, emphasising the need for collaborative efforts and robust datasets.
- [1538] arXiv:2403.10588 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: S3LLM: Large-Scale Scientific Software Understanding with LLMs using Source, Metadata, and DocumentSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: The understanding of large-scale scientific software poses significant challenges due to its diverse codebase, extensive code length, and target computing architectures. The emergence of generative AI, specifically large language models (LLMs), provides novel pathways for understanding such complex scientific codes. This paper presents S3LLM, an LLM-based framework designed to enable the examination of source code, code metadata, and summarized information in conjunction with textual technical reports in an interactive, conversational manner through a user-friendly interface. S3LLM leverages open-source LLaMA-2 models to enhance code analysis through the automatic transformation of natural language queries into domain-specific language (DSL) queries. Specifically, it translates these queries into Feature Query Language (FQL), enabling efficient scanning and parsing of entire code repositories. In addition, S3LLM is equipped to handle diverse metadata types, including DOT, SQL, and customized formats. Furthermore, S3LLM incorporates retrieval augmented generation (RAG) and LangChain technologies to directly query extensive documents. S3LLM demonstrates the potential of using locally deployed open-source LLMs for the rapid understanding of large-scale scientific computing software, eliminating the need for extensive coding expertise, and thereby making the process more efficient and effective. S3LLM is available at this https URL .
- [1539] arXiv:2403.10596 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Neural Erosion: Emulating Controlled Neurodegeneration and Aging in AI SystemsComments: 19 pages, 6 figures in the main text, 5 figures in the AppendixSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Abstract: Creating controlled methods to simulate neurodegeneration in artificial intelligence (AI) is crucial for applications that emulate brain function decline and cognitive disorders. We use IQ tests performed by Large Language Models (LLMs) and, more specifically, the LLaMA 2 to introduce the concept of ``neural erosion." This deliberate erosion involves ablating synapses or neurons, or adding Gaussian noise during or after training, resulting in a controlled progressive decline in the LLMs' performance. We are able to describe the neurodegeneration in the IQ tests and show that the LLM first loses its mathematical abilities and then its linguistic abilities, while further losing its ability to understand the questions. To the best of our knowledge, this is the first work that models neurodegeneration with text data, compared to other works that operate in the computer vision domain. Finally, we draw similarities between our study and cognitive decline clinical studies involving test subjects. We find that with the application of neurodegenerative methods, LLMs lose abstract thinking abilities, followed by mathematical degradation, and ultimately, a loss in linguistic ability, responding to prompts incoherently. These findings are in accordance with human studies.
- [1540] arXiv:2403.10603 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SurvRNC: Learning Ordered Representations for Survival Prediction using Rank-N-ContrastNuman Saeed , Muhammad Ridzuan , Fadillah Adamsyah Maani , Hussain Alasmawi , Karthik Nandakumar , Mohammad YaqubSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Predicting the likelihood of survival is of paramount importance for individuals diagnosed with cancer as it provides invaluable information regarding prognosis at an early stage. This knowledge enables the formulation of effective treatment plans that lead to improved patient outcomes. In the past few years, deep learning models have provided a feasible solution for assessing medical images, electronic health records, and genomic data to estimate cancer risk scores. However, these models often fall short of their potential because they struggle to learn regression-aware feature representations. In this study, we propose Survival Rank-N Contrast (SurvRNC) method, which introduces a loss function as a regularizer to obtain an ordered representation based on the survival times. This function can handle censored data and can be incorporated into any survival model to ensure that the learned representation is ordinal. The model was extensively evaluated on a HEad \& NeCK TumOR (HECKTOR) segmentation and the outcome-prediction task dataset. We demonstrate that using the SurvRNC method for training can achieve higher performance on different deep survival models. Additionally, it outperforms state-of-the-art methods by 3.6% on the concordance index. The code is publicly available on this https URL
- [1541] arXiv:2403.10618 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Limits of Approximating the Median Treatment EffectSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Econometrics (econ.EM); Methodology (stat.ME)
Abstract: Average Treatment Effect (ATE) estimation is a well-studied problem in causal inference. However, it does not necessarily capture the heterogeneity in the data, and several approaches have been proposed to tackle the issue, including estimating the Quantile Treatment Effects. In the finite population setting containing $n$ individuals, with treatment and control values denoted by the potential outcome vectors $\mathbf{a}, \mathbf{b}$, much of the prior work focused on estimating median$(\mathbf{a}) -$ median$(\mathbf{b})$, where median($\mathbf x$) denotes the median value in the sorted ordering of all the values in vector $\mathbf x$. It is known that estimating the difference of medians is easier than the desired estimand of median$(\mathbf{a-b})$, called the Median Treatment Effect (MTE). The fundamental problem of causal inference -- for every individual $i$, we can only observe one of the potential outcome values, i.e., either the value $a_i$ or $b_i$, but not both, makes estimating MTE particularly challenging. In this work, we argue that MTE is not estimable and detail a novel notion of approximation that relies on the sorted order of the values in $\mathbf{a-b}$. Next, we identify a quantity called variability that exactly captures the complexity of MTE estimation. By drawing connections to instance-optimality studied in theoretical computer science, we show that every algorithm for estimating the MTE obtains an approximation error that is no better than the error of an algorithm that computes variability. Finally, we provide a simple linear time algorithm for computing the variability exactly. Unlike much prior work, a particular highlight of our work is that we make no assumptions about how the potential outcome vectors are generated or how they are correlated, except that the potential outcome values are $k$-ary, i.e., take one of $k$ discrete values.
- [1542] arXiv:2403.10667 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and BeyondTianxin Wei , Bowen Jin , Ruirui Li , Hansi Zeng , Zhengyang Wang , Jianhui Sun , Qingyu Yin , Hanqing Lu , Suhang Wang , Jingrui He , Xianfeng TangComments: ICLR 2024Subjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Abstract: Developing a universal model that can effectively harness heterogeneous resources and respond to a wide range of personalized needs has been a longstanding community aspiration. Our daily choices, especially in domains like fashion and retail, are substantially shaped by multi-modal data, such as pictures and textual descriptions. These modalities not only offer intuitive guidance but also cater to personalized user preferences. However, the predominant personalization approaches mainly focus on the ID or text-based recommendation problem, failing to comprehend the information spanning various tasks or modalities. In this paper, our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP), which effectively leverages multi-modal data while eliminating the complexities associated with task- and modality-specific customization. We argue that the advancements in foundational generative modeling have provided the flexibility and effectiveness necessary to achieve the objective. In light of this, we develop a generic and extensible personalization generative framework, that can handle a wide range of personalized needs including item recommendation, product search, preference prediction, explanation generation, and further user-guided image generation. Our methodology enhances the capabilities of foundational language models for personalized tasks by seamlessly ingesting interleaved cross-modal user history information, ensuring a more precise and customized experience for users. To train and evaluate the proposed multi-modal personalized tasks, we also introduce a novel and comprehensive benchmark covering a variety of user requirements. Our experiments on the real-world benchmark showcase the model's potential, outperforming competitive methods specialized for each task.
- [1543] arXiv:2403.10684 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Improved discrete particle swarm optimization using Bee Algorithm and multi-parent crossover method (Case study: Allocation problem and benchmark functions)Comments: 34 pages, 8 figures, 15 tablesSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Compared to other techniques, particle swarm optimization is more frequently utilized because of its ease of use and low variability. However, it is complicated to find the best possible solution in the search space in large-scale optimization problems. Moreover, changing algorithm variables does not influence algorithm convergence much. The PSO algorithm can be combined with other algorithms. It can use their advantages and operators to solve this problem. Therefore, this paper proposes the onlooker multi-parent crossover discrete particle swarm optimization (OMPCDPSO). To improve the efficiency of the DPSO algorithm, we utilized multi-parent crossover on the best solutions. We performed an independent and intensive neighborhood search using the onlooker bees of the bee algorithm. The algorithm uses onlooker bees and crossover. They do local search (exploitation) and global search (exploration). Each of these searches is among the best solutions (employed bees). The proposed algorithm was tested on the allocation problem, which is an NP-hard optimization problem. Also, we used two types of simulated data. They were used to test the scalability and complexity of the better algorithm. Also, fourteen 2D test functions and thirteen 30D test functions were used. They also used twenty IEEE CEC2005 benchmark functions to test the efficiency of OMPCDPSO. Also, to test OMPCDPSO's performance, we compared it to four new binary optimization algorithms and three classic ones. The results show that the OMPCDPSO version had high capability. It performed better than other algorithms. The developed algorithm in this research (OMCDPSO) in 36 test functions out of 47 (76.60%) is better than other algorithms. The Onlooker bees and multi-parent operators significantly impact the algorithm's performance.
- [1544] arXiv:2403.10686 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: AutoHLS: Learning to Accelerate Design Space Exploration for HLS DesignsComments: 5 pages, 6 figures, MWSCAS 2023Subjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: High-level synthesis (HLS) is a design flow that leverages modern language features and flexibility, such as complex data structures, inheritance, templates, etc., to prototype hardware designs rapidly. However, exploring various design space parameters can take much time and effort for hardware engineers to meet specific design specifications. This paper proposes a novel framework called AutoHLS, which integrates a deep neural network (DNN) with Bayesian optimization (BO) to accelerate HLS hardware design optimization. Our tool focuses on HLS pragma exploration and operation transformation. It utilizes integrated DNNs to predict synthesizability within a given FPGA resource budget. We also investigate the potential of emerging quantum neural networks (QNNs) instead of classical DNNs for the AutoHLS pipeline. Our experimental results demonstrate up to a 70-fold speedup in exploration time.
- [1545] arXiv:2403.10691 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language ModelingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
- [1546] arXiv:2403.10692 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EXPLORER: Exploration-guided Reasoning for Textual Reinforcement LearningKinjal Basu , Keerthiram Murugesan , Subhajit Chaudhury , Murray Campbell , Kartik Talamadupula , Tim KlingerSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Abstract: Text-based games (TBGs) have emerged as an important collection of NLP tasks, requiring reinforcement learning (RL) agents to combine natural language understanding with reasoning. A key challenge for agents attempting to solve such tasks is to generalize across multiple games and demonstrate good performance on both seen and unseen objects. Purely deep-RL-based approaches may perform well on seen objects; however, they fail to showcase the same performance on unseen objects. Commonsense-infused deep-RL agents may work better on unseen data; unfortunately, their policies are often not interpretable or easily transferable. To tackle these issues, in this paper, we present EXPLORER which is an exploration-guided reasoning agent for textual reinforcement learning. EXPLORER is neurosymbolic in nature, as it relies on a neural module for exploration and a symbolic module for exploitation. It can also learn generalized symbolic policies and perform well over unseen data. Our experiments show that EXPLORER outperforms the baseline agents on Text-World cooking (TW-Cooking) and Text-World Commonsense (TWC) games.
- [1547] arXiv:2403.10698 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Robust Influence-based Training Methods for Noisy Brain MRISubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Correctly classifying brain tumors is imperative to the prompt and accurate treatment of a patient. While several classification algorithms based on classical image processing or deep learning methods have been proposed to rapidly classify tumors in MR images, most assume the unrealistic setting of noise-free training data. In this work, we study a difficult but realistic setting of training a deep learning model on noisy MR images to classify brain tumors. We propose two training methods that are robust to noisy MRI training data, Influence-based Sample Reweighing (ISR) and Influence-based Sample Perturbation (ISP), which are based on influence functions from robust statistics. Using the influence functions, in ISR, we adaptively reweigh training examples according to how helpful/harmful they are to the training process, while in ISP, we craft and inject helpful perturbation proportional to the influence score. Both ISR and ISP harden the classification model against noisy training data without significantly affecting the generalization ability of the model on test data. We conduct empirical evaluations over a common brain tumor dataset and compare ISR and ISP to three baselines. Our empirical results show that ISR and ISP can efficiently train deep learning models robust against noisy training data.
- [1548] arXiv:2403.10700 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language NavigationFrancesco Taioli , Stefano Rosa , Alberto Castellini , Lorenzo Natale , Alessio Del Bue , Alessandro Farinelli , Marco Cristani , Yiming WangComments: 3 figures, 8 pagesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) is one of the most intuitive yet challenging embodied AI tasks. Agents are tasked to navigate towards a target goal by executing a set of low-level actions, following a series of natural language instructions. All VLN-CE methods in the literature assume that language instructions are exact. However, in practice, instructions given by humans can contain errors when describing a spatial environment due to inaccurate memory or confusion. Current VLN-CE benchmarks do not address this scenario, making the state-of-the-art methods in VLN-CE fragile in the presence of erroneous instructions from human users. For the first time, we propose a novel benchmark dataset that introduces various types of instruction errors considering potential human causes. This benchmark provides valuable insight into the robustness of VLN systems in continuous environments. We observe a noticeable performance drop (up to -25%) in Success Rate when evaluating the state-of-the-art VLN-CE methods on our benchmark. Moreover, we formally define the task of Instruction Error Detection and Localization, and establish an evaluation protocol on top of our benchmark dataset. We also propose an effective method, based on a cross-modal transformer architecture, that achieves the best performance in error detection and localization, compared to baselines. Surprisingly, our proposed method has revealed errors in the validation set of the two commonly used datasets for VLN-CE, i.e., R2R-CE and RxR-CE, demonstrating the utility of our technique in other tasks. Code and dataset will be made available upon acceptance at this https URL
- [1549] arXiv:2403.10704 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: PERL: Parameter Efficient Reinforcement Learning from Human FeedbackHakim Sidahmed , Samrat Phatale , Alex Hutcheson , Zhuonan Lin , Zhang Chen , Zac Yu , Jarvis Jin , Roman Komarytsia , Christiane Ahlheim , Yonghao Zhu , Simral Chaudhary , Bowen Li , Saravanan Ganesh , Bill Byrne , Jessica Hoffmann , Hassan Mansoor , Wei Li , Abhinav Rastogi , Lucas DixonSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Pretrained Large Language Models (LLMs) with human preferences. But training models with RLHF is computationally expensive, and an overall complex process. In this work, we study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu et al. [2021]. We investigate the setup of "Parameter Efficient Reinforcement Learning" (PERL), in which we perform reward model training and reinforcement learning using LoRA. We compare PERL to conventional fine-tuning (full-tuning) across various configurations for 7 benchmarks, including 2 novel datasets, of reward modeling and reinforcement learning. We find that PERL performs on par with the conventional RLHF setting, while training faster, and with less memory. This enables the high performance of RLHF, while reducing the computational burden that limits its adoption as an alignment technique for Large Language Models. We also release 2 novel thumbs up/down preference datasets: "Taskmaster Coffee", and "Taskmaster Ticketing" to promote research around RLHF.
- [1550] arXiv:2403.10707 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Uncovering Latent Themes of Messaging on Social Media by Integrating LLMs: A Case Study on Climate CampaignsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Abstract: This paper introduces a novel approach to uncovering and analyzing themes in social media messaging. Recognizing the limitations of traditional topic-level analysis, which tends to capture only the overarching patterns, this study emphasizes the need for a finer-grained, theme-focused exploration. Conventional methods of theme discovery, involving manual processes and a human-in-the-loop approach, are valuable but face challenges in scalability, consistency, and resource intensity in terms of time and cost. To address these challenges, we propose a machine-in-the-loop approach that leverages the advanced capabilities of Large Language Models (LLMs). This approach allows for a deeper investigation into the thematic aspects of social media discourse, enabling us to uncover a diverse array of themes, each with unique characteristics and relevance, thereby offering a comprehensive understanding of the nuances present within broader topics. Furthermore, this method efficiently maps the text and the newly discovered themes, enhancing our understanding of the thematic nuances in social media messaging. We employ climate campaigns as a case study and demonstrate that our methodology yields more accurate and interpretable results compared to traditional topic models. Our results not only demonstrate the effectiveness of our approach in uncovering latent themes but also illuminate how these themes are tailored for demographic targeting in social media contexts. Additionally, our work sheds light on the dynamic nature of social media, revealing the shifts in the thematic focus of messaging in response to real-world events.
- [1551] arXiv:2403.10717 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction ConsistencyComments: The Twelfth International Conference on Learning Representations (ICLR 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Modern machine learning (ML) systems demand substantial training data, often resorting to external sources. Nevertheless, this practice renders them vulnerable to backdoor poisoning attacks. Prior backdoor defense strategies have primarily focused on the identification of backdoored models or poisoned data characteristics, typically operating under the assumption of access to clean data. In this work, we delve into a relatively underexplored challenge: the automatic identification of backdoor data within a poisoned dataset, all under realistic conditions, i.e., without the need for additional clean data or without manually defining a threshold for backdoor detection. We draw an inspiration from the scaled prediction consistency (SPC) technique, which exploits the prediction invariance of poisoned data to an input scaling factor. Based on this, we pose the backdoor data identification problem as a hierarchical data splitting optimization problem, leveraging a novel SPC-based loss function as the primary optimization objective. Our innovation unfolds in several key aspects. First, we revisit the vanilla SPC method, unveiling its limitations in addressing the proposed backdoor identification problem. Subsequently, we develop a bi-level optimization-based approach to precisely identify backdoor data by minimizing the advanced SPC loss. Finally, we demonstrate the efficacy of our proposal against a spectrum of backdoor attacks, encompassing basic label-corrupted attacks as well as more sophisticated clean-label attacks, evaluated across various benchmark datasets. Experiment results show that our approach often surpasses the performance of current baselines in identifying backdoor data points, resulting in about 4%-36% improvement in average AUROC. Codes are available at this https URL .
- [1552] arXiv:2403.10726 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: Strict Partitioning for Sporadic Rigid Gang TasksComments: to be published in IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2024)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Abstract: The rigid gang task model is based on the idea of executing multiple threads simultaneously on a fixed number of processors to increase efficiency and performance. Although there is extensive literature on global rigid gang scheduling, partitioned approaches have several practical advantages (e.g., task isolation and reduced scheduling overheads). In this paper, we propose a new partitioned scheduling strategy for rigid gang tasks, named strict partitioning. The method creates disjoint partitions of tasks and processors to avoid inter-partition interference. Moreover, it tries to assign tasks with similar volumes (i.e., parallelisms) to the same partition so that the intra-partition interference can be reduced. Within each partition, the tasks can be scheduled using any type of scheduler, which allows the use of a less pessimistic schedulability test. Extensive synthetic experiments and a case study based on Edge TPU benchmarks show that strict partitioning achieves better schedulability performance than state-of-the-art global gang schedulability analyses for both preemptive and non-preemptive rigid gang task sets.
- [1553] arXiv:2403.10732 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Variance-Dependent Regret Bounds for Non-stationary Linear BanditsComments: 30 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We investigate the non-stationary stochastic linear bandit problem where the reward distribution evolves each round. Existing algorithms characterize the non-stationarity by the total variation budget $B_K$, which is the summation of the change of the consecutive feature vectors of the linear bandits over $K$ rounds. However, such a quantity only measures the non-stationarity with respect to the expectation of the reward distribution, which makes existing algorithms sub-optimal under the general non-stationary distribution setting. In this work, we propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. Specifically, we introduce two novel algorithms: Restarted Weighted$\text{OFUL}^+$ and Restarted $\text{SAVE}^+$. These algorithms address cases where the variance information of the rewards is known and unknown, respectively. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary stochastic linear bandits under different settings. Experimental evaluations further validate the superior performance of our proposed algorithms over existing works.
- [1554] arXiv:2403.10750 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Depression Detection on Social Media with Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Depression harms. However, due to a lack of mental health awareness and fear of stigma, many patients do not actively seek diagnosis and treatment, leading to detrimental outcomes. Depression detection aims to determine whether an individual suffers from depression by analyzing their history of posts on social media, which can significantly aid in early detection and intervention. It mainly faces two key challenges: 1) it requires professional medical knowledge, and 2) it necessitates both high accuracy and explainability. To address it, we propose a novel depression detection system called DORIS, combining medical knowledge and the recent advances in large language models (LLMs). Specifically, to tackle the first challenge, we proposed an LLM-based solution to first annotate whether high-risk texts meet medical diagnostic criteria. Further, we retrieve texts with high emotional intensity and summarize critical information from the historical mood records of users, so-called mood courses. To tackle the second challenge, we combine LLM and traditional classifiers to integrate medical knowledge-guided features, for which the model can also explain its prediction results, achieving both high accuracy and explainability. Extensive experimental results on benchmarking datasets show that, compared to the current best baseline, our approach improves by 0.036 in AUPRC, which can be considered significant, demonstrating the effectiveness of our approach and its high value as an NLP application.
- [1555] arXiv:2403.10751 (cross-list from cs.IT) [ pdf , ps , html , other ]
-
Title: LIGHTCODE: Light Analytical and Neural Codes for Channels with FeedbackComments: 13 pages, 11 figuresSubjects: Information Theory (cs.IT) ; Artificial Intelligence (cs.AI)
Abstract: The design of reliable and efficient codes for channels with feedback remains a longstanding challenge in communication theory. While significant improvements have been achieved by leveraging deep learning techniques, neural codes often suffer from high computational costs, a lack of interpretability, and limited practicality in resource-constrained settings. We focus on designing low-complexity coding schemes that are interpretable and more suitable for communication systems. We advance both analytical and neural codes. First, we demonstrate that POWERBLAST, an analytical coding scheme inspired by Schalkwijk-Kailath (SK) and Gallager-Nakiboglu (GN) schemes, achieves notable reliability improvements over both SK and GN schemes, outperforming neural codes in high signal-to-noise ratio (SNR) regions. Next, to enhance reliability in low-SNR regions, we propose LIGHTCODE, a lightweight neural code that achieves state-of-the-art reliability while using a fraction of memory and compute compared to existing deep-learning-based codes. Finally, we systematically analyze the learned codes, establishing connections between LIGHTCODE and POWERBLAST, identifying components crucial for performance, and providing interpretation aided by linear regression analysis.
- [1556] arXiv:2403.10764 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ECRC: Emotion-Causality Recognition in Korean Conversation for GCNComments: 10 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In this multi-task learning study on simultaneous analysis of emotions and their underlying causes in conversational contexts, deep neural network methods were employed to effectively process and train large labeled datasets. However, these approaches are typically limited to conducting context analyses across the entire corpus because they rely on one of the two methods: word- or sentence-level embedding. The former struggles with polysemy and homonyms, whereas the latter causes information loss when processing long sentences. In this study, we overcome the limitations of previous embeddings by utilizing both word- and sentence-level embeddings. Furthermore, we propose the emotion-causality recognition in conversation (ECRC) model, which is based on a novel graph structure, thereby leveraging the strengths of both embedding methods. This model uniquely integrates the bidirectional long short-term memory (Bi-LSTM) and graph neural network (GCN) models for Korean conversation analysis. Compared with models that rely solely on one embedding method, the proposed model effectively structures abstract concepts, such as language features and relationships, thereby minimizing information loss. To assess model performance, we compared the multi-task learning results of three deep neural network models with varying graph structures. Additionally, we evaluated the proposed model using Korean and English datasets. The experimental results show that the proposed model performs better in emotion and causality multi-task learning (74.62% and 75.30%, respectively) when node and edge characteristics are incorporated into the graph structure. Similar results were recorded for the Korean ECC and Wellness datasets (74.62% and 73.44%, respectively) with 71.35% on the IEMOCAP English dataset.
- [1557] arXiv:2403.10776 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: From Melting Pots to Misrepresentations: Exploring Harms in Generative AIComments: In CHI 2024: Generative AI and HCI workshop (GenAICHI 24)Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: With the widespread adoption of advanced generative models such as Gemini and GPT, there has been a notable increase in the incorporation of such models into sociotechnical systems, categorized under AI-as-a-Service (AIaaS). Despite their versatility across diverse sectors, concerns persist regarding discriminatory tendencies within these models, particularly favoring selected `majority' demographics across various sociodemographic dimensions. Despite widespread calls for diversification of media representations, marginalized racial and ethnic groups continue to face persistent distortion, stereotyping, and neglect within the AIaaS context. In this work, we provide a critical summary of the state of research in the context of social harms to lead the conversation to focus on their implications. We also present open-ended research questions, guided by our discussion, to help define future research pathways.
- [1558] arXiv:2403.10780 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Segment Any Object Model (SAOM): Real-to-Simulation Fine-Tuning Strategy for Multi-Class Multi-Instance SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Multi-class multi-instance segmentation is the task of identifying masks for multiple object classes and multiple instances of the same class within an image. The foundational Segment Anything Model (SAM) is designed for promptable multi-class multi-instance segmentation but tends to output part or sub-part masks in the "everything" mode for various real-world applications. Whole object segmentation masks play a crucial role for indoor scene understanding, especially in robotics applications. We propose a new domain invariant Real-to-Simulation (Real-Sim) fine-tuning strategy for SAM. We use object images and ground truth data collected from Ai2Thor simulator during fine-tuning (real-to-sim). To allow our Segment Any Object Model (SAOM) to work in the "everything" mode, we propose the novel nearest neighbour assignment method, updating point embeddings for each ground-truth mask. SAOM is evaluated on our own dataset collected from Ai2Thor simulator. SAOM significantly improves on SAM, with a 28% increase in mIoU and a 25% increase in mAcc for 54 frequently-seen indoor object classes. Moreover, our Real-to-Simulation fine-tuning strategy demonstrates promising generalization performance in real environments without being trained on the real-world data (sim-to-real). The dataset and the code will be released after publication.
- [1559] arXiv:2403.10781 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Exploring Chinese Humor Generation: A Study on Two-Part Allegorical SayingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Humor, a culturally nuanced aspect of human language, poses challenges for computational understanding and generation, especially in Chinese humor, which remains relatively unexplored in the NLP community. This paper investigates the capability of state-of-the-art language models to comprehend and generate Chinese humor, specifically focusing on training them to create allegorical sayings. We employ two prominent training methods: fine-tuning a medium-sized language model and prompting a large one. Our novel fine-tuning approach incorporates fused Pinyin embeddings to consider homophones and employs contrastive learning with synthetic hard negatives to distinguish humor elements. Human-annotated results show that these models can generate humorous allegorical sayings, with prompting proving to be a practical and effective method. However, there is still room for improvement in generating allegorical sayings that match human creativity.
- [1560] arXiv:2403.10787 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Time Series Representation Learning with Supervised Contrastive Temporal TransformerComments: 8 pages, 8 figures, IJCNN 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Finding effective representations for time series data is a useful but challenging task. Several works utilize self-supervised or unsupervised learning methods to address this. However, there still remains the open question of how to leverage available label information for better representations. To answer this question, we exploit pre-existing techniques in time series and representation learning domains and develop a simple, yet novel fusion model, called: \textbf{S}upervised \textbf{CO}ntrastive \textbf{T}emporal \textbf{T}ransformer (SCOTT). We first investigate suitable augmentation methods for various types of time series data to assist with learning change-invariant representations. Secondly, we combine Transformer and Temporal Convolutional Networks in a simple way to efficiently learn both global and local features. Finally, we simplify Supervised Contrastive Loss for representation learning of labelled time series data. We preliminarily evaluate SCOTT on a downstream task, Time Series Classification, using 45 datasets from the UCR archive. The results show that with the representations learnt by SCOTT, even a weak classifier can perform similar to or better than existing state-of-the-art models (best performance on 23/45 datasets and highest rank against 9 baseline models). Afterwards, we investigate SCOTT's ability to address a real-world task, online Change Point Detection (CPD), on two datasets: a human activity dataset and a surgical patient dataset. We show that the model performs with high reliability and efficiency on the online CPD problem ($\sim$98\% and $\sim$97\% area under precision-recall curve respectively). Furthermore, we demonstrate the model's potential in tackling early detection and show it performs best compared to other candidates.
- [1561] arXiv:2403.10795 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: From Words to Routes: Applying Large Language Models to Vehicle RoutingComments: Submitted to IEEE Robotics and Automation Society (IROS 2024)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: LLMs have shown impressive progress in robotics (e.g., manipulation and navigation) with natural language task descriptions. The success of LLMs in these tasks leads us to wonder: What is the ability of LLMs to solve vehicle routing problems (VRPs) with natural language task descriptions? In this work, we study this question in three steps. First, we construct a dataset with 21 types of single- or multi-vehicle routing problems. Second, we evaluate the performance of LLMs across four basic prompt paradigms of text-to-code generation, each involving different types of text input. We find that the basic prompt paradigm, which generates code directly from natural language task descriptions, performs the best for GPT-4, achieving 56% feasibility, 40% optimality, and 53% efficiency. Third, based on the observation that LLMs may not be able to provide correct solutions at the initial attempt, we propose a framework that enables LLMs to refine solutions through self-reflection, including self-debugging and self-verification. With GPT-4, our proposed framework achieves a 16% increase in feasibility, a 7% increase in optimality, and a 15% increase in efficiency. Moreover, we examine the sensitivity of GPT-4 to task descriptions, specifically focusing on how its performance changes when certain details are omitted from the task descriptions, yet the core meaning is preserved. Our findings reveal that such omissions lead to a notable decrease in performance: 4% in feasibility, 4% in optimality, and 5% in efficiency. Website: this https URL
- [1562] arXiv:2403.10799 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Efficient Pruning of Large Language Model with Adaptive Estimation FusionJun Liu , Chao Wu , Changdi Yang , Hao Tang , Haoye Dong , Zhenglun Kong , Geng Yuan , Wei Niu , Dong Huang , Yanzhi WangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.
- [1563] arXiv:2403.10803 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Enhancing Out-of-Distribution Detection with Multitesting-based Layer-wise Feature FusionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Deploying machine learning in open environments presents the challenge of encountering diverse test inputs that differ significantly from the training data. These out-of-distribution samples may exhibit shifts in local or global features compared to the training distribution. The machine learning (ML) community has responded with a number of methods aimed at distinguishing anomalous inputs from original training data. However, the majority of previous studies have primarily focused on the output layer or penultimate layer of pre-trained deep neural networks. In this paper, we propose a novel framework, Multitesting-based Layer-wise Out-of-Distribution (OOD) Detection (MLOD), to identify distributional shifts in test samples at different levels of features through rigorous multiple testing procedure. Our approach distinguishes itself from existing methods as it does not require modifying the structure or fine-tuning of the pre-trained classifier. Through extensive experiments, we demonstrate that our proposed framework can seamlessly integrate with any existing distance-based inspection method while efficiently utilizing feature extractors of varying depths. Our scheme effectively enhances the performance of out-of-distribution detection when compared to baseline methods. In particular, MLOD-Fisher achieves superior performance in general. When trained using KNN on CIFAR10, MLOD-Fisher significantly lowers the false positive rate (FPR) from 24.09% to 7.47% on average compared to merely utilizing the features of the last layer.
- [1564] arXiv:2403.10805 (cross-list from cs.SD) [ pdf , ps , other ]
-
Title: Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature InferenceFan Zhang , Zhaohan Wang , Xin Lyu , Siyuan Zhao , Mengjian Li , Weidong Geng , Naye Ji , Hui Du , Fuxing Gao , Hao Wu , Shunman LiComments: 12 pages,Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
Abstract: Speech-driven gesture generation is an emerging field within virtual human creation. However, a significant challenge lies in accurately determining and processing the multitude of input features (such as acoustic, semantic, emotional, personality, and even subtle unknown features). Traditional approaches, reliant on various explicit feature inputs and complex multimodal processing, constrain the expressiveness of resulting gestures and limit their applicability. To address these challenges, we present Persona-Gestor, a novel end-to-end generative model designed to generate highly personalized 3D full-body gestures solely relying on raw speech audio. The model combines a fuzzy feature extractor and a non-autoregressive Adaptive Layer Normalization (AdaLN) transformer diffusion architecture. The fuzzy feature extractor harnesses a fuzzy inference strategy that automatically infers implicit, continuous fuzzy features. These fuzzy features, represented as a unified latent feature, are fed into the AdaLN transformer. The AdaLN transformer introduces a conditional mechanism that applies a uniform function across all tokens, thereby effectively modeling the correlation between the fuzzy features and the gesture sequence. This module ensures a high level of gesture-speech synchronization while preserving naturalness. Finally, we employ the diffusion model to train and infer various gestures. Extensive subjective and objective evaluations on the Trinity, ZEGGS, and BEAT datasets confirm our model's superior performance to the current state-of-the-art approaches. Persona-Gestor improves the system's usability and generalization capabilities, setting a new benchmark in speech-driven gesture synthesis and broadening the horizon for virtual human technology. Supplementary videos and code can be accessed at this https URL
- [1565] arXiv:2403.10819 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Incentivized Exploration of Non-Stationary Stochastic BanditsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the reward. We consider two different non-stationary environments: abruptly-changing and continuously-changing, and propose respective incentivized exploration algorithms. We show that the proposed algorithms achieve sublinear regret and compensation over time, thus effectively incentivizing exploration despite the nonstationarity and the biased or drifted feedback.
- [1566] arXiv:2403.10823 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image AnalysisSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Generalist foundation model has ushered in newfound capabilities in medical domain. However, the contradiction between the growing demand for high-quality annotated data with patient privacy continues to intensify. The utilization of medical artificial intelligence generated content (Med-AIGC) as an inexhaustible resource repository arises as a potential solution to address the aforementioned challenge. Here we harness 1 million open-source synthetic fundus images paired with natural language descriptions, to curate an ethical language-image foundation model for retina image analysis named VisionCLIP. VisionCLIP achieves competitive performance on three external datasets compared with the existing method pre-trained on real-world data in a zero-shot fashion. The employment of artificially synthetic images alongside corresponding textual data for training enables the medical foundation model to successfully assimilate knowledge of disease symptomatology, thereby circumventing potential breaches of patient confidentiality.
- [1567] arXiv:2403.10824 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: LookALike: Human Mimicry based collaborative decision makingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Artificial General Intelligence falls short when communicating role specific nuances to other systems. This is more pronounced when building autonomous LLM agents capable and designed to communicate with each other for real world problem solving. Humans can communicate context and domain specific nuances along with knowledge, and that has led to refinement of skills. In this work we propose and evaluate a novel method that leads to knowledge distillation among LLM agents leading to realtime human role play preserving unique contexts without relying on any stored data or pretraining. We also evaluate how our system performs better in simulated real world tasks compared to state of the art.
- [1568] arXiv:2403.10834 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SF(DA)$^2$: Source-free Domain Adaptation Through the Lens of Data AugmentationComments: ICLR 2024. Code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In the face of the deep learning model's vulnerability to domain shift, source-free domain adaptation (SFDA) methods have been proposed to adapt models to new, unseen target domains without requiring access to source domain data. Although the potential benefits of applying data augmentation to SFDA are attractive, several challenges arise such as the dependence on prior knowledge of class-preserving transformations and the increase in memory and computational requirements. In this paper, we propose Source-free Domain Adaptation Through the Lens of Data Augmentation (SF(DA)$^2$), a novel approach that leverages the benefits of data augmentation without suffering from these challenges. We construct an augmentation graph in the feature space of the pretrained model using the neighbor relationships between target features and propose spectral neighborhood clustering to identify partitions in the prediction space. Furthermore, we propose implicit feature augmentation and feature disentanglement as regularization loss functions that effectively utilize class semantic information within the feature space. These regularizers simulate the inclusion of an unlimited number of augmented target features into the augmentation graph while minimizing computational and memory demands. Our method shows superior adaptation performance in SFDA scenarios, including 2D image and 3D point cloud datasets and a highly imbalanced dataset.
- [1569] arXiv:2403.10842 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Twin Transformer using Gated Dynamic Learnable Attention mechanism for Fault Detection and Diagnosis in the Tennessee Eastman ProcessSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Fault detection and diagnosis (FDD) is a crucial task for ensuring the safety and efficiency of industrial processes. We propose a novel FDD methodology for the Tennessee Eastman Process (TEP), a widely used benchmark for chemical process control. The model employs two separate Transformer branches, enabling independent processing of input data and potential extraction of diverse information. A novel attention mechanism, Gated Dynamic Learnable Attention (GDLAttention), is introduced which integrates a gating mechanism and dynamic learning capabilities. The gating mechanism modulates the attention weights, allowing the model to focus on the most relevant parts of the input. The dynamic learning approach adapts the attention strategy during training, potentially leading to improved performance. The attention mechanism uses a bilinear similarity function, providing greater flexibility in capturing complex relationships between query and key vectors. In order to assess the effectiveness of our approach, we tested it against 21 and 18 distinct fault scenarios in TEP, and compared its performance with several established FDD techniques. The outcomes indicate that the method outperforms others in terms of accuracy, false alarm rate, and misclassification rate. This underscores the robustness and efficacy of the approach for FDD in intricate industrial processes.
- [1570] arXiv:2403.10850 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: GAgent: An Adaptive Rigid-Soft Gripping Agent with Vision Language Models for Complex Lighting EnvironmentsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces GAgent: an Gripping Agent designed for open-world environments that provides advanced cognitive abilities via VLM agents and flexible grasping abilities with variable stiffness soft grippers. GAgent comprises three primary components - Prompt Engineer module, Visual-Language Model (VLM) core and Workflow module. These three modules enhance gripper success rates by recognizing objects and materials and accurately estimating grasp area even under challenging lighting conditions. As part of creativity, researchers also created a bionic hybrid soft gripper with variable stiffness capable of gripping heavy loads while still gently engaging objects. This intelligent agent, featuring VLM-based cognitive processing with bionic design, shows promise as it could potentially benefit UAVs in various scenarios.
- [1571] arXiv:2403.10853 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Just Say the Name: Online Continual Learning with Category Names Only via Data GenerationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In real-world scenarios, extensive manual annotation for continual learning is impractical due to prohibitive costs. Although prior arts, influenced by large-scale webly supervised training, suggest leveraging web-scraped data in continual learning, this poses challenges such as data imbalance, usage restrictions, and privacy concerns. Addressing the risks of continual webly supervised training, we present an online continual learning framework - Generative Name only Continual Learning (G-NoCL). The proposed G-NoCL uses a set of generators G along with the learner. When encountering new concepts (i.e., classes), G-NoCL employs the novel sample complexity-guided data ensembling technique DIverSity and COmplexity enhancing ensemBlER (DISCOBER) to optimally sample training data from generated data. Through extensive experimentation, we demonstrate superior performance of DISCOBER in G-NoCL online CL benchmarks, covering both In-Distribution (ID) and Out-of-Distribution (OOD) generalization evaluations, compared to naive generator-ensembling, web-supervised, and manually annotated data.
- [1572] arXiv:2403.10860 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Efficient Domain Adaptation for Endoscopic Visual OdometrySubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Visual odometry plays a crucial role in endoscopic imaging, yet the scarcity of realistic images with ground truth poses poses a significant challenge. Therefore, domain adaptation offers a promising approach to bridge the pre-operative planning domain with the intra-operative real domain for learning odometry information. However, existing methodologies suffer from inefficiencies in the training time. In this work, an efficient neural style transfer framework for endoscopic visual odometry is proposed, which compresses the time from pre-operative planning to testing phase to less than five minutes. For efficient traing, this work focuses on training modules with only a limited number of real images and we exploit pre-operative prior information to dramatically reduce training duration. Moreover, during the testing phase, we propose a novel Test Time Adaptation (TTA) method to mitigate the gap in lighting conditions between training and testing datasets. Experimental evaluations conducted on two public endoscope datasets showcase that our method achieves state-of-the-art accuracy in visual odometry tasks while boasting the fastest training speeds. These results demonstrate significant promise for intra-operative surgery applications.
- [1573] arXiv:2403.10863 (cross-list from q-bio.GN) [ pdf , ps , html , other ]
-
Title: stMCDI: Masked Conditional Diffusion Model with Graph Neural Network for Spatial Transcriptomics Data ImputationComments: Submitted to IJCAI2024Subjects: Genomics (q-bio.GN) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Spatially resolved transcriptomics represents a significant advancement in single-cell analysis by offering both gene expression data and their corresponding physical locations. However, this high degree of spatial resolution entails a drawback, as the resulting spatial transcriptomic data at the cellular level is notably plagued by a high incidence of missing values. Furthermore, most existing imputation methods either overlook the spatial information between spots or compromise the overall gene expression data distribution. To address these challenges, our primary focus is on effectively utilizing the spatial location information within spatial transcriptomic data to impute missing values, while preserving the overall data distribution. We introduce \textbf{stMCDI}, a novel conditional diffusion model for spatial transcriptomics data imputation, which employs a denoising network trained using randomly masked data portions as guidance, with the unmasked data serving as conditions. Additionally, it utilizes a GNN encoder to integrate the spatial position information, thereby enhancing model performance. The results obtained from spatial transcriptomics datasets elucidate the performance of our methods relative to existing approaches.
- [1574] arXiv:2403.10882 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on KoreanChangSu Choi , Yongbin Jeong , Seoyoon Park , InHo Won , HyeonSeok Lim , SangMin Kim , Yejee Kang , Chanhyuk Yoon , Jaewan Park , Yiseul Lee , HyeJin Lee , Younggyun Hahm , Hansaem Kim , KyungTae LimSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.
- [1575] arXiv:2403.10903 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DTOR: Decision Tree Outlier Regressor to explain anomaliesRiccardo Crupi , Daniele Regoli , Alessandro Damiano Sabatino , Immacolata Marano , Massimiliano Brinis , Luca Albertazzi , Andrea Cirillo , Andrea Claudio CosentiniSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Explaining outliers occurrence and mechanism of their occurrence can be extremely important in a variety of domains. Malfunctions, frauds, threats, in addition to being correctly identified, oftentimes need a valid explanation in order to effectively perform actionable counteracts. The ever more widespread use of sophisticated Machine Learning approach to identify anomalies make such explanations more challenging. We present the Decision Tree Outlier Regressor (DTOR), a technique for producing rule-based explanations for individual data points by estimating anomaly scores generated by an anomaly detection model. This is accomplished by first applying a Decision Tree Regressor, which computes the estimation score, and then extracting the relative path associated with the data point score. Our results demonstrate the robustness of DTOR even in datasets with a large number of features. Additionally, in contrast to other rule-based approaches, the generated rules are consistently satisfied by the points to be explained. Furthermore, our evaluation metrics indicate comparable performance to Anchors in outlier explanation tasks, with reduced execution time.
- [1576] arXiv:2403.10923 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Interpretable Machine Learning for TabPFNDavid Rundel , Julius Kobialka , Constantin von Crailsheim , Matthias Feurer , Thomas Nagler , David RügamerSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
Abstract: The recently developed Prior-Data Fitted Networks (PFNs) have shown very promising results for applications in low-data regimes. The TabPFN model, a special case of PFNs for tabular data, is able to achieve state-of-the-art performance on a variety of classification tasks while producing posterior predictive distributions in mere seconds by in-context learning without the need for learning parameters or hyperparameter tuning. This makes TabPFN a very attractive option for a wide range of domain applications. However, a major drawback of the method is its lack of interpretability. Therefore, we propose several adaptations of popular interpretability methods that we specifically design for TabPFN. By taking advantage of the unique properties of the model, our adaptations allow for more efficient computations than existing implementations. In particular, we show how in-context learning facilitates the estimation of Shapley values by avoiding approximate retraining and enables the use of Leave-One-Covariate-Out (LOCO) even when working with large-scale Transformers. In addition, we demonstrate how data valuation methods can be used to address scalability challenges of TabPFN. Our proposed methods are implemented in a package tabpfn_iml and made available at this https URL .
- [1577] arXiv:2403.10944 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Human Centered AI for Indian Legal Text AnalyticsComments: 7 pages, 7 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Legal research is a crucial task in the practice of law. It requires intense human effort and intellectual prudence to research a legal case and prepare arguments. Recent boom in generative AI has not translated to proportionate rise in impactful legal applications, because of low trustworthiness and and the scarcity of specialized datasets for training Large Language Models (LLMs). This position paper explores the potential of LLMs within Legal Text Analytics (LTA), highlighting specific areas where the integration of human expertise can significantly enhance their performance to match that of experts. We introduce a novel dataset and describe a human centered, compound AI system that principally incorporates human inputs for performing LTA tasks with LLMs.
- [1578] arXiv:2403.10949 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: SelfIE: Self-Interpretation of Large Language Model EmbeddingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.
- [1579] arXiv:2403.10967 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot GeneralizationComments: 33 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Zero-shot generalization (ZSG) to unseen dynamics is a major challenge for creating generally capable embodied agents. To address the broader challenge, we start with the simpler setting of contextual reinforcement learning (cRL), assuming observability of the context values that parameterize the variation in the system's dynamics, such as the mass or dimensions of a robot, without making further simplifying assumptions about the observability of the Markovian state. Toward the goal of ZSG to unseen variation in context, we propose the contextual recurrent state-space model (cRSSM), which introduces changes to the world model of the Dreamer (v3) (Hafner et al., 2023). This allows the world model to incorporate context for inferring latent Markovian states from the observations and modeling the latent dynamics. Our experiments show that such systematic incorporation of the context improves the ZSG of the policies trained on the ``dreams'' of the world model. We further find qualitatively that our approach allows Dreamer to disentangle the latent state from context, allowing it to extrapolate its dreams to the many worlds of unseen contexts. The code for all our experiments is available at \url{ this https URL }.
- [1580] arXiv:2403.10968 (cross-list from cs.CR) [ pdf , ps , other ]
-
Title: Enhancing IoT Security Against DDoS Attacks through Federated LearningSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The rapid proliferation of the Internet of Things (IoT) has ushered in transformative connectivity between physical devices and the digital realm. Nonetheless, the escalating threat of Distributed Denial of Service (DDoS) attacks jeopardizes the integrity and reliability of IoT networks. Conventional DDoS mitigation approaches are ill-equipped to handle the intricacies of IoT ecosystems, potentially compromising data privacy. This paper introduces an innovative strategy to bolster the security of IoT networks against DDoS attacks by harnessing the power of Federated Learning that allows multiple IoT devices or edge nodes to collaboratively build a global model while preserving data privacy and minimizing communication overhead. The research aims to investigate Federated Learning's effectiveness in detecting and mitigating DDoS attacks in IoT. Our proposed framework leverages IoT devices' collective intelligence for real-time attack detection without compromising sensitive data. This study proposes innovative deep autoencoder approaches for data dimensionality reduction, retraining, and partial selection to enhance the performance and stability of the proposed model. Additionally, two renowned aggregation algorithms, FedAvg and FedAvgM, are employed in this research. Various metrics, including true positive rate, false positive rate, and F1-score, are employed to evaluate the model. The dataset utilized in this research, N-BaIoT, exhibits non-IID data distribution, where data categories are distributed quite differently. The negative impact of these distribution disparities is managed by employing retraining and partial selection techniques, enhancing the final model's stability. Furthermore, evaluation results demonstrate that the FedAvgM aggregation algorithm outperforms FedAvg, indicating that in non-IID datasets, FedAvgM provides better stability and performance.
- [1581] arXiv:2403.10984 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: IoTCO2: Assessing the End-To-End Carbon Footprint of Internet-of-Things-Enabled Deep LearningComments: 5 figures, 8 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: To improve privacy and ensure quality-of-service (QoS), deep learning (DL) models are increasingly deployed on Internet of Things (IoT) devices for data processing, significantly increasing the carbon footprint associated with DL on IoT, covering both operational and embodied aspects. Existing operational energy predictors often overlook quantized DL models and emerging neural processing units (NPUs), while embodied carbon footprint modeling tools neglect non-computing hardware components common in IoT devices, creating a gap in accurate carbon footprint modeling tools for IoT-enabled DL. This paper introduces \textit{\carb}, an end-to-end modeling tool for precise carbon footprint estimation in IoT-enabled DL, demonstrating a maximum $\pm21\%$ deviation in carbon footprint values compared to actual measurements across various DL models. Additionally, practical applications of \carb are showcased through multiple user case studies.
- [1582] arXiv:2403.10988 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Boosting Flow-based Generative Super-Resolution Models via Learned PriorComments: Accepted to CVPR2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Flow-based super-resolution (SR) models have demonstrated astonishing capabilities in generating high-quality images. However, these methods encounter several challenges during image generation, such as grid artifacts, exploding inverses, and suboptimal results due to a fixed sampling temperature. To overcome these issues, this work introduces a conditional learned prior to the inference phase of a flow-based SR model. This prior is a latent code predicted by our proposed latent module conditioned on the low-resolution image, which is then transformed by the flow model into an SR image. Our framework is designed to seamlessly integrate with any contemporary flow-based SR model without modifying its architecture or pre-trained weights. We evaluate the effectiveness of our proposed framework through extensive experiments and ablation analyses. The proposed framework successfully addresses all the inherent issues in flow-based SR models and enhances their performance in various SR scenarios. Our code is available at: this https URL
- [1583] arXiv:2403.10995 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Edge Private Graph Neural Networks with Singular Value PerturbationComments: Accepted at Privacy Enhancing Technologies Symposium (PETS) 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
Abstract: Graph neural networks (GNNs) play a key role in learning representations from graph-structured data and are demonstrated to be useful in many applications. However, the GNN training pipeline has been shown to be vulnerable to node feature leakage and edge extraction attacks. This paper investigates a scenario where an attacker aims to recover private edge information from a trained GNN model. Previous studies have employed differential privacy (DP) to add noise directly to the adjacency matrix or a compact graph representation. The added perturbations cause the graph structure to be substantially morphed, reducing the model utility. We propose a new privacy-preserving GNN training algorithm, Eclipse, that maintains good model utility while providing strong privacy protection on edges. Eclipse is based on two key observations. First, adjacency matrices in graph structures exhibit low-rank behavior. Thus, Eclipse trains GNNs with a low-rank format of the graph via singular values decomposition (SVD), rather than the original graph. Using the low-rank format, Eclipse preserves the primary graph topology and removes the remaining residual edges. Eclipse adds noise to the low-rank singular values instead of the entire graph, thereby preserving the graph privacy while still maintaining enough of the graph structure to maintain model utility. We theoretically show Eclipse provide formal DP guarantee on edges. Experiments on benchmark graph datasets show that Eclipse achieves significantly better privacy-utility tradeoff compared to existing privacy-preserving GNN training methods. In particular, under strong privacy constraints ($\epsilon$ < 4), Eclipse shows significant gains in the model utility by up to 46%. We further demonstrate that Eclipse also has better resilience against common edge attacks (e.g., LPA), lowering the attack AUC by up to 5% compared to other state-of-the-art baselines.
- [1584] arXiv:2403.10997 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: N2F2: Hierarchical Scene Understanding with Nested Neural Feature FieldsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Abstract: Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.
- [1585] arXiv:2403.11009 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related LanguagesFahim Faisal , Orevaoghene Ahia , Aarohi Srivastava , Kabir Ahuja , David Chiang , Yulia Tsvetkov , Antonios AnastasopoulosComments: Equal contribution: Fahim Faisal, Orevaoghene AhiaSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: this https URL
- [1586] arXiv:2403.11015 (cross-list from q-bio.MN) [ pdf , ps , other ]
-
Title: Identifying the Attractors of Gene Regulatory Networks from Expression Data under Uncertainty: An Interpretable ApproachSubjects: Molecular Networks (q-bio.MN) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: In systems biology, attractor landscape analysis of gene regulatory networks is recognized as a powerful computational tool for studying various cellular states from proliferation and differentiation to senescence and apoptosis. Therefore, accurate identification of attractors plays a critical role in determination of the cell fates. On the other hand, in a real biological circuit, genetic/epigenetic alterations as well as varying environmental factors drastically take effect on the location, characteristics, and even the number of attractors. The central question is: Given a temporal gene expression profile of a real gene regulatory network, how can the attractors be robustly identified in the presence of huge amount of uncertainty? This paper addresses this question using a novel approach based on Zadeh Computing with Words. The proposed scheme could effectively identify the attractors from temporal gene expression data in terms of both fuzzy logic-based and linguistic descriptions which are simply interpretable by human experts. Therefore, this method can be considered as an effective step towards interpretable artificial intelligence. Without loss of generality, genetic toggle switch is considered as the case study. The nonlinear dynamics of this benchmark gene regulatory network is computationally modeled by the notion of uncertain stochastic differential equations. The results of in-silico study demonstrate the efficiency and robustness of the proposed method.
- [1587] arXiv:2403.11021 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Neuro-Symbolic Video SearchSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.
- [1588] arXiv:2403.11027 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Reward Guided Latent Consistency DistillationComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM's efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM's output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM's single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25 times inference acceleration without quality loss.
As directly optimizing towards differentiable RMs can suffer from over-optimization, we overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on HPSv2's test set, surpassing those achieved by the baseline LCM. - [1589] arXiv:2403.11046 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Regulating Chatbot Output via Inter-Informational CompetitionComments: 20,000-word legal Article, forthcoming in Northwestern Journal of Technology and Intellectual PropertySubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Machine Learning (cs.LG)
Abstract: The advent of ChatGPT has sparked over a year of regulatory frenzy. However, few existing studies have rigorously questioned the assumption that, if left unregulated, AI chatbot's output would inflict tangible, severe real harm on human affairs. Most researchers have overlooked the critical possibility that the information market itself can effectively mitigate these risks and, as a result, they tend to use regulatory tools to address the issue directly. This Article develops a yardstick for reevaluating both AI-related content risks and corresponding regulatory proposals by focusing on inter-informational competition among various outlets. The decades-long history of regulating information and communications technologies indicates that regulators tend to err too much on the side of caution and to put forward excessive regulatory measures when encountering the uncertainties brought about by new technologies. In fact, a trove of empirical evidence has demonstrated that market competition among information outlets can effectively mitigate most risks and that overreliance on regulation is not only unnecessary but detrimental, as well. This Article argues that sufficient competition among chatbots and other information outlets in the information marketplace can sufficiently mitigate and even resolve most content risks posed by generative AI technologies. This renders certain loudly advocated regulatory strategies, like mandatory prohibitions, licensure, curation of datasets, and notice-and-response regimes, truly unnecessary and even toxic to desirable competition and innovation throughout the AI industry. Ultimately, the ideas that I advance in this Article should pour some much-needed cold water on the regulatory frenzy over generative AI and steer the issue back to a rational track.
- [1590] arXiv:2403.11047 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series ForecastingComments: Published at ACM ICAIF 2023Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Abstract: Time series forecasting plays a crucial role in decision-making across various domains, but it presents significant challenges. Recent studies have explored image-driven approaches using computer vision models to address these challenges, often employing lineplots as the visual representation of time series data. In this paper, we propose a novel approach that uses time-frequency spectrograms as the visual representation of time series data. We introduce the use of a vision transformer for multimodal learning, showcasing the advantages of our approach across diverse datasets from different domains. To evaluate its effectiveness, we compare our method against statistical baselines (EMA and ARIMA), a state-of-the-art deep learning-based approach (DeepAR), other visual representations of time series data (lineplot images), and an ablation study on using only the time series as input. Our experiments demonstrate the benefits of utilizing spectrograms as a visual representation for time series data, along with the advantages of employing a vision transformer for simultaneous learning in both the time and frequency domains.
- [1591] arXiv:2403.11073 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Tokensome: Towards a Genetic Vision-Language GPT for Explainable and Cognitive KaryotypingHaoxi Zhang , Xinxu Zhang , Yuanxin Lin , Maiqi Wang , Yi Lai , Yu Wang , Linfeng Yu , Yufeng Xu , Ran Cheng , Edward SzczerbickiComments: Preprint. Work in progressSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Automatic karyotype analysis is often defined as a visual perception task focused solely on chromosomal object-level modeling. This definition has led most existing methods to overlook componential and holistic information, significantly constraining model performance. Moreover, the lack of interpretability in current technologies hinders clinical adoption. In this paper, we introduce Tokensome, a novel vision-language model based on chromosome tokenization for explainable and cognitive karyotyping. Tokensome elevates the method from the conventional visual perception layer to the cognitive decision-making layer. This elevation enables the integration of domain knowledge and cognitive reasoning via knowledge graphs and LLMs, markedly enhancing model's explainability and facilitating abnormality detection.
- [1592] arXiv:2403.11074 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Audio-Visual Segmentation via Unlabeled Frame ExploitationComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.
- [1593] arXiv:2403.11075 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: GOMA: Proactive Embodied Cooperative Communication via Goal-Oriented Mental AlignmentComments: 8 pages, 5 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Verbal communication plays a crucial role in human cooperation, particularly when the partners only have incomplete information about the task, environment, and each other's mental state. In this paper, we propose a novel cooperative communication framework, Goal-Oriented Mental Alignment (GOMA). GOMA formulates verbal communication as a planning problem that minimizes the misalignment between the parts of agents' mental states that are relevant to the goals. This approach enables an embodied assistant to reason about when and how to proactively initialize communication with humans verbally using natural language to help achieve better cooperation. We evaluate our approach against strong baselines in two challenging environments, Overcooked (a multiplayer game) and VirtualHome (a household simulator). Our experimental results demonstrate that large language models struggle with generating meaningful communication that is grounded in the social and physical context. In contrast, our approach can successfully generate concise verbal communication for the embodied assistant to effectively boost the performance of the cooperation as well as human users' perception of the assistant.
- [1594] arXiv:2403.11082 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive LearningComments: Accepted at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL Findings) 2024. [ this https URL ]Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based representations often exhibit poor robustness in adversarial settings. In this paper, we introduce RobustSentEmbed, a self-supervised sentence embedding framework designed to improve both generalization and robustness in diverse text representation tasks and against a diverse set of adversarial attacks. Through the generation of high-risk adversarial perturbations and their utilization in a novel objective function, RobustSentEmbed adeptly learns high-quality and robust sentence embeddings. Our experiments confirm the superiority of RobustSentEmbed over state-of-the-art representations. Specifically, Our framework achieves a significant reduction in the success rate of various adversarial attacks, notably reducing the BERTAttack success rate by almost half (from 75.51\% to 38.81\%). The framework also yields improvements of 1.59\% and 0.23\% in semantic textual similarity tasks and various transfer tasks, respectively.
- [1595] arXiv:2403.11092 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual ConceptsComments: NAACL 2024 Main ConferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Image and Video Processing (eess.IV)
Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions.
- [1596] arXiv:2403.11106 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Self-Supervised Quantization-Aware Knowledge DistillationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. However, existing works applying KD to QAT require tedious hyper-parameter tuning to balance the weights of different loss terms, assume the availability of labeled training data, and require complex, computationally intensive training procedures for good performance. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD first unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the KL-Loss between the full-precision and low-bit models for KD and the discretization error for quantization, without supervision from labels. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures. Our code is at: this https URL .
- [1597] arXiv:2403.11114 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Phasic Diversity Optimization for Population-Based Reinforcement LearningComments: 7 pages, 4 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Reviewing the previous work of diversity Rein-forcement Learning,diversity is often obtained via an augmented loss function,which requires a balance between reward and diversity.Generally,diversity optimization algorithms use Multi-armed Bandits algorithms to select the coefficient in the pre-defined space. However, the dynamic distribution of reward signals for MABs or the conflict between quality and diversity limits the performance of these methods. We introduce the Phasic Diversity Optimization (PDO) algorithm, a Population-Based Training framework that separates reward and diversity training into distinct phases instead of optimizing a multi-objective function. In the auxiliary phase, agents with poor performance diversified via determinants will not replace the better agents in the archive. The decoupling of reward and diversity allows us to use an aggressive diversity optimization in the auxiliary phase without performance degradation. Furthermore, we construct a dogfight scenario for aerial agents to demonstrate the practicality of the PDO algorithm. We introduce two implementations of PDO archive and conduct tests in the newly proposed adversarial dogfight and MuJoCo simulations. The results show that our proposed algorithm achieves better performance than baselines.
- [1598] arXiv:2403.11116 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: PhD: A Prompted Visual Hallucination Evaluation DatasetJiazhen Liu , Yuhan Fu , Ruobing Xie , Runquan Xie , Xingwu Sun , Fengzong Lian , Zhanhui Kang , Xirong LiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The rapid growth of Large Language Models (LLMs) has driven the development of Large Vision-Language Models (LVLMs). The challenge of hallucination, prevalent in LLMs, also emerges in LVLMs. However, most existing efforts mainly focus on object hallucination in LVLM, ignoring diverse types of LVLM hallucinations. In this study, we delve into the Intrinsic Vision-Language Hallucination (IVL-Hallu) issue, thoroughly analyzing different types of IVL-Hallu on their causes and reflections. Specifically, we propose several novel IVL-Hallu tasks and categorize them into four types: (a) object hallucination, which arises from the misidentification of objects, (b) attribute hallucination, which is caused by the misidentification of attributes, (c) multi-modal conflicting hallucination, which derives from the contradictions between textual and visual information, and (d) counter-common-sense hallucination, which owes to the contradictions between the LVLM knowledge and actual images. Based on these taxonomies, we propose a more challenging benchmark named PhD to evaluate and explore IVL-Hallu. An automated pipeline is proposed for generating different types of IVL-Hallu data. Extensive experiments on five SOTA LVLMs reveal their inability to effectively tackle our proposed IVL-Hallu tasks, with detailed analyses and insights on the origins and possible solutions of these new challenging IVL-Hallu tasks, facilitating future researches on IVL-Hallu and LVLM. The benchmark can be accessed at this https URL
- [1599] arXiv:2403.11124 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Scaling Data Diversity for Fine-Tuning Language Models in Human AlignmentComments: Accepted by LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Alignment with human preference prevents large language models (LLMs) from generating misleading or toxic content while requiring high-cost human feedback. Assuming resources of human annotation are limited, there are two different ways of allocating considered: more diverse PROMPTS or more diverse RESPONSES to be labeled. Nonetheless, a straightforward comparison between their impact is absent. In this work, we first control the diversity of both sides according to the number of samples for fine-tuning, which can directly reflect their influence. We find that instead of numerous prompts, more responses but fewer prompts better trigger LLMs for human alignment. Additionally, the concept of diversity for prompts can be more complex than responses that are typically quantified by single digits. Consequently, a new formulation of prompt diversity is proposed, further implying a linear correlation with the final performance of LLMs after fine-tuning. We also leverage it on data augmentation and conduct experiments to show its effect on different algorithms.
- [1600] arXiv:2403.11152 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluation Ethics of LLMs in Legal DomainComments: 10 pages, in processing of ACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, the utilization of large language models for natural language dialogue has gained momentum, leading to their widespread adoption across various domains. However, their universal competence in addressing challenges specific to specialized fields such as law remains a subject of scrutiny. The incorporation of legal ethics into the model has been overlooked by researchers. We asserts that rigorous ethic evaluation is essential to ensure the effective integration of large language models in legal domains, emphasizing the need to assess domain-specific proficiency and domain-specific ethic. To address this, we propose a novelty evaluation methodology, utilizing authentic legal cases to evaluate the fundamental language abilities, specialized legal knowledge and legal robustness of large language models (LLMs). The findings from our comprehensive evaluation contribute significantly to the academic discourse surrounding the suitability and performance of large language models in legal domains.
- [1601] arXiv:2403.11162 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient InversionComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
Abstract: Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot generation where a pretrained model is fine-tuned on a small set of images to capture a specific style or object. Despite their success, concerns exist about potential copyright violations stemming from the use of unauthorized data in this process. In response, we present Contrasting Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring vivid visual representations for digital copyright authentication. Our approach involves removing partial information of an image and recovering missing details by exploiting conceptual differences between the pretrained and fine-tuned models. We formulate the differences as KL divergence between latent variables of the two models when given the same input image, which can be maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). The similarity between original and recovered images serves as a strong indicator of potential infringements. Extensive experiments on the WikiArt and Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital copyright authentication, surpassing alternative validation techniques. Code implementation is available at this https URL .
- [1602] arXiv:2403.11169 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Correcting misinformation on social media with a large language modelComments: 53 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Real-world misinformation can be partially correct and even factual but misleading. It undermines public trust in science and democracy, particularly on social media, where it can spread rapidly. High-quality and timely correction of misinformation that identifies and explains its (in)accuracies has been shown to effectively reduce false beliefs. Despite the wide acceptance of manual correction, it is difficult to be timely and scalable, a concern as technologies like large language models (LLMs) make misinformation easier to produce. LLMs also have versatile capabilities that could accelerate misinformation correction-however, they struggle due to a lack of recent information, a tendency to produce false content, and limitations in addressing multimodal information. We propose MUSE, an LLM augmented with access to and credibility evaluation of up-to-date information. By retrieving evidence as refutations or contexts, MUSE identifies and explains (in)accuracies in a piece of content-not presupposed to be misinformation-with references. It also describes images and conducts multimodal searches to verify and correct multimodal content. Fact-checking experts evaluate responses to social media content that are not presupposed to be (non-)misinformation but broadly include incorrect, partially correct, and correct posts, that may or may not be misleading. We propose and evaluate 13 dimensions of misinformation correction quality, ranging from the accuracy of identifications and factuality of explanations to the relevance and credibility of references. The results demonstrate MUSE's ability to promptly write high-quality responses to potential misinformation on social media-overall, MUSE outperforms GPT-4 by 37% and even high-quality responses from laypeople by 29%. This work reveals LLMs' potential to help combat real-world misinformation effectively and efficiently.
- [1603] arXiv:2403.11175 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Prior-dependent analysis of posterior sampling reinforcement learning with function approximationComments: Published in the 27th International Conference on Artificial Intelligence and Statistics (AISTATS)Subjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
Abstract: This work advances randomized exploration in reinforcement learning (RL) with function approximation modeled by linear mixture MDPs. We establish the first prior-dependent Bayesian regret bound for RL with function approximation; and refine the Bayesian regret analysis for posterior sampling reinforcement learning (PSRL), presenting an upper bound of ${\mathcal{O}}(d\sqrt{H^3 T \log T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the planning horizon, and $T$ the total number of interactions. This signifies a methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$ factor over the previous benchmark (Osband and Van Roy, 2014) specified to linear mixture MDPs. Our approach, leveraging a value-targeted model learning perspective, introduces a decoupling argument and a variance reduction technique, moving beyond traditional analyses reliant on confidence sets and concentration inequalities to formalize Bayesian regret bounds more effectively.
- [1604] arXiv:2403.11199 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Graph Unitary Message PassingComments: 15 pages, 3 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Message passing mechanism contributes to the success of GNNs in various applications, but also brings the oversquashing problem. Recent works combat oversquashing by improving the graph spectrums with rewiring techniques, disrupting the structural bias in graphs, and having limited improvement on oversquashing in terms of oversquashing measure. Motivated by unitary RNN, we propose Graph Unitary Message Passing (GUMP) to alleviate oversquashing in GNNs by applying unitary adjacency matrix for message passing. To design GUMP, a transformation is first proposed to make general graphs have unitary adjacency matrix and keep its structural bias. Then, unitary adjacency matrix is obtained with a unitary projection algorithm, which is implemented by utilizing the intrinsic structure of unitary adjacency matrix and allows GUMP to be permutation-equivariant. Experimental results show the effectiveness of GUMP in improving the performance on various graph learning tasks.
- [1605] arXiv:2403.11202 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation frameworkKaiyan Chang , Kun Wang , Nan Yang , Ying Wang , Dantong Jin , Wenlong Zhu , Zhirong Chen , Cangyuan Li , Hao Yan , Yunhao Zhou , Zhuoliang Zhao , Yuan Cheng , Yudong Pan , Yiqi Liu , Mengdi Wang , Shengwen Liang , yinhe han , Huawei Li , Xiaowei LiComments: Accepted by DAC 2024; please note that this is not the final camera-ready versionSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Abstract: Recent advances in large language models have demonstrated their potential for automated generation of hardware description language (HDL) code from high-level prompts. Researchers have utilized fine-tuning to enhance the ability of these large language models (LLMs) in the field of Chip Design. However, the lack of Verilog data hinders further improvement in the quality of Verilog generation by LLMs. Additionally, the absence of a Verilog and Electronic Design Automation (EDA) script data augmentation framework significantly increases the time required to prepare the training dataset for LLM trainers. This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts. For Verilog generation, it translates Verilog files to an abstract syntax tree and then maps nodes to natural language with a predefined template. For Verilog repair, it uses predefined rules to generate the wrong verilog file and then pairs EDA Tool feedback with the right and wrong verilog file. For EDA Script generation, it uses existing LLM(GPT-3.5) to obtain the description of the Script. To evaluate the effectiveness of our data augmentation method, we finetune Llama2-13B and Llama2-7B models using the dataset generated by our augmentation framework. The results demonstrate a significant improvement in the Verilog generation tasks with LLMs. Moreover, the accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark. Our 13B model (ChipGPT-FT) has a pass rate improvement compared with GPT-3.5 in Verilog generation and outperforms in EDA script (i.e., SiliconCompiler) generation with only 200 EDA script data.
- [1606] arXiv:2403.11204 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Partitioned Neural Network Training via Synthetic Intermediate LabelsComments: 12 pages, 10 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: The proliferation of extensive neural network architectures, particularly deep learning models, presents a challenge in terms of resource-intensive training. GPU memory constraints have become a notable bottleneck in training such sizable models. Existing strategies, including data parallelism, model parallelism, pipeline parallelism, and fully sharded data parallelism, offer partial solutions. Model parallelism, in particular, enables the distribution of the entire model across multiple GPUs, yet the ensuing data communication between these partitions slows down training. Additionally, the substantial memory overhead required to store auxiliary parameters on each GPU compounds computational demands. Instead of using the entire model for training, this study advocates partitioning the model across GPUs and generating synthetic intermediate labels to train individual segments. These labels, produced through a random process, mitigate memory overhead and computational load. This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy. To validate this method, a 6-layer fully connected neural network is partitioned into two parts and its performance is assessed on the extended MNIST dataset. Experimental results indicate that the proposed approach achieves similar testing accuracies to conventional training methods, while significantly reducing memory and computational requirements. This work contributes to mitigating the resource-intensive nature of training large neural networks, paving the way for more efficient deep learning model development.
- [1607] arXiv:2403.11207 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of DataPaul S. Scotti , Mihir Tripathy , Cesar Kadir Torrico Villanueva , Reese Kneeland , Tong Chen , Ashutosh Narang , Charan Santhirasegaran , Jonathan Xu , Thomas Naselaris , Kenneth A. Norman , Tanishq Mathew AbrahamComments: Code at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Abstract: Reconstructions of visual perception from brain activity have improved tremendously, but the practical utility of such methods has been limited. This is because such models are trained independently per subject where each subject requires dozens of hours of expensive fMRI training data to attain high-quality results. The present work showcases high-quality reconstructions using only 1 hour of fMRI training data. We pretrain our model across 7 subjects and then fine-tune on minimal data from a new subject. Our novel functional alignment procedure linearly maps all brain data to a shared-subject latent space, followed by a shared non-linear mapping to CLIP image space. We then map from CLIP space to pixel space by fine-tuning Stable Diffusion XL to accept CLIP latents as inputs instead of text. This approach improves out-of-subject generalization with limited training data and also attains state-of-the-art image retrieval and reconstruction metrics compared to single-subject approaches. MindEye2 demonstrates how accurate reconstructions of perception are possible from a single visit to the MRI facility. All code is available on GitHub.
- [1608] arXiv:2403.11220 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: CPA-Enhancer: Chain-of-Thought Prompted Adaptive Enhancer for Object Detection under Unknown DegradationsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Object detection methods under known single degradations have been extensively investigated. However, existing approaches require prior knowledge of the degradation type and train a separate model for each, limiting their practical applications in unpredictable environments. To address this challenge, we propose a chain-of-thought (CoT) prompted adaptive enhancer, CPA-Enhancer, for object detection under unknown degradations. Specifically, CPA-Enhancer progressively adapts its enhancement strategy under the step-by-step guidance of CoT prompts, that encode degradation-related information. To the best of our knowledge, it's the first work that exploits CoT prompting for object detection tasks. Overall, CPA-Enhancer is a plug-and-play enhancement model that can be integrated into any generic detectors to achieve substantial gains on degraded images, without knowing the degradation type priorly. Experimental results demonstrate that CPA-Enhancer not only sets the new state of the art for object detection but also boosts the performance of other downstream vision tasks under unknown degradations.
- [1609] arXiv:2403.11259 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: A learning-based solution approach to the application placement problem in mobile edge computing under uncertaintySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
Abstract: Placing applications in mobile edge computing servers presents a complex challenge involving many servers, users, and their requests. Existing algorithms take a long time to solve high-dimensional problems with significant uncertainty scenarios. Therefore, an efficient approach is required to maximize the quality of service while considering all technical constraints. One of these approaches is machine learning, which emulates optimal solutions for application placement in edge servers. Machine learning models are expected to learn how to allocate user requests to servers based on the spatial positions of users and servers. In this study, the problem is formulated as a two-stage stochastic programming. A sufficient amount of training records is generated by varying parameters such as user locations, their request rates, and solving the optimization model. Then, based on the distance features of each user from the available servers and their request rates, machine learning models generate decision variables for the first stage of the stochastic optimization model, which is the user-to-server request allocation, and are employed as independent decision agents that reliably mimic the optimization model. Support Vector Machines (SVM) and Multi-layer Perceptron (MLP) are used in this research to achieve practical decisions from the stochastic optimization models. The performance of each model has shown an execution effectiveness of over 80%. This research aims to provide a more efficient approach for tackling high-dimensional problems and scenarios with uncertainties in mobile edge computing by leveraging machine learning models for optimal decision-making in request allocation to edge servers. These results suggest that machine-learning models can significantly improve solution times compared to conventional approaches.
- [1610] arXiv:2403.11261 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Lie Group Approach to Riemannian Batch NormalizationComments: Accepted by ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Mathematical Software (cs.MS)
Abstract: Manifold-valued measurements exist in numerous applications within computer vision and machine learning. Recent studies have extended Deep Neural Networks (DNNs) to manifolds, and concomitantly, normalization techniques have also been adapted to several manifolds, referred to as Riemannian normalization. Nonetheless, most of the existing Riemannian normalization methods have been derived in an ad hoc manner and only apply to specific manifolds. This paper establishes a unified framework for Riemannian Batch Normalization (RBN) techniques on Lie groups. Our framework offers the theoretical guarantee of controlling both the Riemannian mean and variance. Empirically, we focus on Symmetric Positive Definite (SPD) manifolds, which possess three distinct types of Lie group structures. Using the deformation concept, we generalize the existing Lie groups on SPD manifolds into three families of parameterized Lie groups. Specific normalization layers induced by these Lie groups are then proposed for SPD neural networks. We demonstrate the effectiveness of our approach through three sets of experiments: radar recognition, human action recognition, and electroencephalography (EEG) classification. The code is available at this https URL .
- [1611] arXiv:2403.11262 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Understanding Diffusion Models by Feynman's Path IntegralComments: 27 pages, 14 figuresSubjects: Machine Learning (cs.LG) ; Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th)
Abstract: Score-based diffusion models have proven effective in image generation and have gained widespread usage; however, the underlying factors contributing to the performance disparity between stochastic and deterministic (i.e., the probability flow ODEs) sampling schemes remain unclear. We introduce a novel formulation of diffusion models using Feynman's path integral, which is a formulation originally developed for quantum physics. We find this formulation providing comprehensive descriptions of score-based generative models, and demonstrate the derivation of backward stochastic differential equations and loss functions.The formulation accommodates an interpolating parameter connecting stochastic and deterministic sampling schemes, and we identify this parameter as a counterpart of Planck's constant in quantum physics. This analogy enables us to apply the Wentzel-Kramers-Brillouin (WKB) expansion, a well-established technique in quantum physics, for evaluating the negative log-likelihood to assess the performance disparity between stochastic and deterministic sampling schemes.
- [1612] arXiv:2403.11265 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Forging the Forger: An Attempt to Improve Authorship Verification via Data AugmentationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author. In this paper, we investigate the potential benefits of augmenting the classifier training set with (negative) synthetic examples. These synthetic examples are generated to imitate the style of the author of interest. We analyze the improvements in classifier prediction that this augmentation brings to bear in the task of AV in an adversarial setting. In particular, we experiment with three different generator architectures (one based on Recurrent Neural Networks, another based on small-scale transformers, and another based on the popular GPT model) and with two training strategies (one inspired by standard Language Models, and another inspired by Wasserstein Generative Adversarial Networks). We evaluate our hypothesis on five datasets (three of which have been specifically collected to represent an adversarial setting) and using two learning algorithms for the AV classifier (Support Vector Machines and Convolutional Neural Networks). This experimentation has yielded negative results, revealing that, although our methodology proves effective in many adversarial settings, its benefits are too sporadic for a pragmatical application.
- [1613] arXiv:2403.11292 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multi-Relational Graph Neural Network for Out-of-Domain Link PredictionComments: 8 pages, 3 figures, 3 Tables, conference [accepted in IEEE WCCI 2024]Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Dynamic multi-relational graphs are an expressive relational representation for data enclosing entities and relations of different types, and where relationships are allowed to vary in time. Addressing predictive tasks over such data requires the ability to find structure embeddings that capture the diversity of the relationships involved, as well as their dynamic evolution. In this work, we establish a novel class of challenging tasks for dynamic multi-relational graphs involving out-of-domain link prediction, where the relationship being predicted is not available in the input graph. We then introduce a novel Graph Neural Network model, named GOOD, designed specifically to tackle the out-of-domain generalization problem. GOOD introduces a novel design concept for multi-relation embedding aggregation, based on the idea that good representations are such when it is possible to disentangle the mixing proportions of the different relational embeddings that have produced it. We also propose five benchmarks based on two retail domains, where we show that GOOD can effectively generalize predictions out of known relationship types and achieve state-of-the-art results. Most importantly, we provide insights into problems where out-of-domain prediction might be preferred to an in-domain formulation, that is, where the relationship to be predicted has very few positive examples.
- [1614] arXiv:2403.11299 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SQ-LLaVA: Self-Questioning for Large Vision-Language AssistantSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Recent advancements in the vision-language model have shown notable generalization in vision-language tasks after visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models becomes the whole network's bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which are costly to obtain. However, the image contains rich contextual information that has been largely under-explored. This paper first attempts to harness this overlooked context within visual instruction data, training the model to self-supervised `learning' how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a consistent performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts.
- [1615] arXiv:2403.11304 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Pioneering SE(2)-Equivariant Trajectory Planning for Automated DrivingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Planning the trajectory of the controlled ego vehicle is a key challenge in automated driving. As for human drivers, predicting the motions of surrounding vehicles is important to plan the own actions. Recent motion prediction methods utilize equivariant neural networks to exploit geometric symmetries in the scene. However, no existing method combines motion prediction and trajectory planning in a joint step while guaranteeing equivariance under roto-translations of the input space. We address this gap by proposing a lightweight equivariant planning model that generates multi-modal joint predictions for all vehicles and selects one mode as the ego plan. The equivariant network design improves sample efficiency, guarantees output stability, and reduces model parameters. We further propose equivariant route attraction to guide the ego vehicle along a high-level route provided by an off-the-shelf GPS navigation system. This module creates a momentum from embedded vehicle positions toward the route in latent space while keeping the equivariance property. Route attraction enables goal-oriented behavior without forcing the vehicle to stick to the exact route. We conduct experiments on the challenging nuScenes dataset to investigate the capability of our planner. The results show that the planned trajectory is stable under roto-translations of the input scene which demonstrates the equivariance of our model. Despite using only a small split of the dataset for training, our method improves L2 distance at 3 s by 20.6 % and surpasses the state of the art.
- [1616] arXiv:2403.11322 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: StateFlow: Enhancing LLM Task-Solving through State-Driven WorkflowsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: It is a notable trend to use Large Language Models (LLMs) to tackle complex tasks, e.g., tasks that require a sequence of actions and dynamic interaction with tools and external environments. In this paper, we propose StateFlow, a novel LLM-based task-solving paradigm that conceptualizes complex task-solving processes as state machines. In StateFlow, we distinguish between "process grounding" (via state and state transitions) and "sub-task solving" (through actions within a state), enhancing control and interpretability of the task-solving procedure. A state represents the status of a running process. The transitions between states are controlled by heuristic rules or decisions made by the LLM, allowing for a dynamic and adaptive progression. Upon entering a state, a series of actions is executed, involving not only calling LLMs guided by different prompts, but also the utilization of external tools as needed. Our results show that StateFlow significantly enhances LLMs' efficiency. For instance, StateFlow achieves 13% and 28% higher success rates compared to ReAct in InterCode SQL and ALFWorld benchmark, with 5x and 3x less cost respectively. We also show that StateFlow can be combined with iterative refining methods like Reflexion to further improve performance.
- [1617] arXiv:2403.11328 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Domain-Guided Masked Autoencoders for Unique Player IdentificationBavesh Balaji , Jerrin Bright , Sirisha Rambhatla , Yuhao Chen , Alexander Wong , John Zelek , David A ClausiComments: Submitted to 21st International Conference on Robots and Vision (CRV'24), Guelph, Ontario, CanadaSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Unique player identification is a fundamental module in vision-driven sports analytics. Identifying players from broadcast videos can aid with various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatic detection of jersey numbers using deep features is challenging primarily due to: a) motion blur, b) low resolution video feed, and c) occlusions. With their recent success in various vision tasks, masked autoencoders (MAEs) have emerged as a superior alternative to conventional feature extractors. However, most MAEs simply zero-out image patches either randomly or focus on where to mask rather than how to mask. Motivated by human vision, we devise a novel domain-guided masking policy for MAEs termed d-MAE to facilitate robust feature extraction in the presence of motion blur for player identification. We further introduce a new spatio-temporal network leveraging our novel d-MAE for unique player identification. We conduct experiments on three large-scale sports datasets, including a curated baseball dataset, the SoccerNet dataset, and an in-house ice hockey dataset. We preprocess the datasets using an upgraded keyframe identification (KfID) module by focusing on frames containing jersey numbers. Additionally, we propose a keyframe-fusion technique to augment keyframes, preserving spatial and temporal context. Our spatio-temporal network showcases significant improvements, surpassing the current state-of-the-art by 8.58%, 4.29%, and 1.20% in the test set accuracies, respectively. Rigorous ablations highlight the effectiveness of our domain-guided masking approach and the refined KfID module, resulting in performance enhancements of 1.48% and 1.84% respectively, compared to original architectures.
- [1618] arXiv:2403.11330 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal FeedbackComments: 10 pages, 3 figures, 2 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals. At a high level, our approach (dubbed GELI) learns a local, turn-level reward model by decomposing the human-provided Global Explicit (GE) session-level reward, using Local Implicit (LI) multimodal reward signals to crossmodally shape the reward decomposition step. This decomposed reward model is then used as part of the standard RHLF pipeline improve an LLM-based dialog agent. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
- [1619] arXiv:2403.11337 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We propose a deep learning based novel prediction framework for enhanced bandwidth reduction in motion transfer enabled video applications such as video conferencing, virtual reality gaming and privacy preservation for patient health monitoring. To model complex motion, we use the First Order Motion Model (FOMM) that represents dynamic objects using learned keypoints along with their local affine transformations. Keypoints are extracted by a self-supervised keypoint detector and organized in a time series corresponding to the video frames. Prediction of keypoints, to enable transmission using lower frames per second on the source device, is performed using a Variational Recurrent Neural Network (VRNN). The predicted keypoints are then synthesized to video frames using an optical flow estimator and a generator network. This efficacy of leveraging keypoint based representations in conjunction with VRNN based prediction for both video animation and reconstruction is demonstrated on three diverse datasets. For real-time applications, our results show the effectiveness of our proposed architecture by enabling up to 2x additional bandwidth reduction over existing keypoint based video motion transfer frameworks without significantly compromising video quality.
- [1620] arXiv:2403.11345 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Independent RL for Cooperative-Competitive Agents: A Mean-Field PerspectiveSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Abstract: We address in this paper Reinforcement Learning (RL) among agents that are grouped into teams such that there is cooperation within each team but general-sum (non-zero sum) competition across different teams. To develop an RL method that provably achieves a Nash equilibrium, we focus on a linear-quadratic structure. Moreover, to tackle the non-stationarity induced by multi-agent interactions in the finite population setting, we consider the case where the number of agents within each team is infinite, i.e., the mean-field setting. This results in a General-Sum LQ Mean-Field Type Game (GS-MFTGs). We characterize the Nash equilibrium (NE) of the GS-MFTG, under a standard invertibility condition. This MFTG NE is then shown to be $\mathcal{O}(1/M)$-NE for the finite population game where $M$ is a lower bound on the number of agents in each team. These structural results motivate an algorithm called Multi-player Receding-horizon Natural Policy Gradient (MRPG), where each team minimizes its cumulative cost independently in a receding-horizon manner. Despite the non-convexity of the problem, we establish that the resulting algorithm converges to a global NE through a novel problem decomposition into sub-problems using backward recursive discrete-time Hamilton-Jacobi-Isaacs (HJI) equations, in which independent natural policy gradient is shown to exhibit linear convergence under time-independent diagonal dominance. Experiments illuminate the merits of this approach in practice.
- [1621] arXiv:2403.11346 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models Using Synthetic Back-Translation DataSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{ this https URL }.
- [1622] arXiv:2403.11348 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic CircuitsComments: Accepted to ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Conformal prediction has shown spurring performance in constructing statistically rigorous prediction sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during the inference can violate the exchangeability assumption, challenge the coverage guarantees, and result in a subsequent decline in empirical coverage. In this work, we propose a certifiably robust learning-reasoning conformal prediction framework (COLEP) via probabilistic circuits, which comprise a data-driven learning component that trains statistical models to learn different semantic concepts, and a reasoning component that encodes knowledge and characterizes the relationships among the trained models for logic reasoning. To achieve exact and efficient reasoning, we employ probabilistic circuits (PCs) within the reasoning component. Theoretically, we provide end-to-end certification of prediction coverage for COLEP in the presence of bounded adversarial perturbations. We also provide certified coverage considering the finite size of the calibration set. Furthermore, we prove that COLEP achieves higher prediction coverage and accuracy over a single model as long as the utilities of knowledge models are non-trivial. Empirically, we show the validity and tightness of our certified coverage, demonstrating the robust conformal prediction of COLEP on various datasets, including GTSRB, CIFAR10, and AwA2. We show that COLEP achieves up to 12% improvement in certified coverage on GTSRB, 9% on CIFAR-10, and 14% on AwA2.
- [1623] arXiv:2403.11353 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Solvent-Aware 2D NMR Prediction: Leveraging Multi-Tasking Training and Iterative Self-Training StrategiesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
Abstract: Nuclear magnetic resonance (NMR) spectroscopy plays a pivotal role in various scientific fields, offering insights into structural information, electronic properties and dynamic behaviors of molecules. Accurate NMR spectrum prediction efficiently produces candidate molecules, enabling chemists to compare them with actual experimental spectra. This process aids in confirming molecular structures or pinpointing discrepancies, guiding further investigation. Machine Learning (ML) has then emerged as a promising alternative approach for predicting atomic NMR chemical shits of molecules given their structures. Although significant progresses have been made in predicting one-dimensional (1D) NMR, two-dimensional (2D) NMR prediction via ML remains a challenge due to the lack of annotated NMR training datasets. To address this gap, we propose an iterative self-training (IST) approach to train a deep learning model for predicting atomic 2DNMR shifts and assigning peaks in experimental spectra. Our model undergoes an initial pre-training phase employing a Multi-Task Training (MTT) approach, which simultaneously leverages annotated 1D NMR datasets of both $^{1}\text{H}$ and $^{13}\text{C}$ spectra to enhance its understanding of NMR spectra. Subsequently, the pre-trained model is utilized to generate pseudo-annotations for unlabelled 2D NMR spectra, which are subsequently used to refine the 2D NMR prediction model. Our approach iterates between annotated unlabelled 2D NMR data and refining our 2D NMR prediction model until convergence. Finally, our model is able to not only accurately predict 2D NMR but also annotate peaks in experimental 2D NMR spectra. Experimental results show that our model is capable of accurately handling medium-sized and large molecules, including polysaccharides, underscoring its effectiveness.
- [1624] arXiv:2403.11363 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: IGANN Sparse: Bridging Sparsity and Interpretability with Non-linear InsightComments: Preprint conditionally accepted for archival and presentation at the 32nd European Conference on Information Systems (ECIS 2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Feature selection is a critical component in predictive analytics that significantly affects the prediction accuracy and interpretability of models. Intrinsic methods for feature selection are built directly into model learning, providing a fast and attractive option for large amounts of data. Machine learning algorithms, such as penalized regression models (e.g., lasso) are the most common choice when it comes to in-built feature selection. However, they fail to capture non-linear relationships, which ultimately affects their ability to predict outcomes in intricate datasets. In this paper, we propose IGANN Sparse, a novel machine learning model from the family of generalized additive models, which promotes sparsity through a non-linear feature selection process during training. This ensures interpretability through improved model sparsity without sacrificing predictive performance. Moreover, IGANN Sparse serves as an exploratory tool for information systems researchers to unveil important non-linear relationships in domains that are characterized by complex patterns. Our ongoing research is directed at a thorough evaluation of the IGANN Sparse model, including user studies that allow to assess how well users of the model can benefit from the reduced number of features. This will allow for a deeper understanding of the interactions between linear vs. non-linear modeling, number of selected features, and predictive performance.
- [1625] arXiv:2403.11368 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Driving Style Alignment for LLM-powered Driver AgentSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Recently, LLM-powered driver agents have demonstrated considerable potential in the field of autonomous driving, showcasing human-like reasoning and decision-making abilities.However, current research on aligning driver agent behaviors with human driving styles remains limited, partly due to the scarcity of high-quality natural language data from human driving this http URL address this research gap, we propose a multi-alignment framework designed to align driver agents with human driving styles through demonstrations and feedback. Notably, we construct a natural language dataset of human driver behaviors through naturalistic driving experiments and post-driving interviews, offering high-quality human demonstrations for LLM alignment. The framework's effectiveness is validated through simulation experiments in the CARLA urban traffic simulator and further corroborated by human evaluations. Our research offers valuable insights into designing driving agents with diverse driving styles.The implementation of the framework and details of the dataset can be found at the link.
- [1626] arXiv:2403.11395 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Automated data processing and feature engineering for deep learning and big data applications: a surveyComments: Journal of Information and Intelligence (2024)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Databases (cs.DB)
Abstract: Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for Big Data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing--e.g., data cleaning, labeling, missing data imputation, and categorical data encoding--as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering--specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.
- [1627] arXiv:2403.11401 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Scene-LLM: Extending Language Model for 3D Visual Understanding and ReasoningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
- [1628] arXiv:2403.11402 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Embracing the Generative AI Revolution: Advancing Tertiary Education in Cybersecurity with GPTSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: The rapid advancement of generative Artificial Intelligence (AI) technologies, particularly Generative Pre-trained Transformer (GPT) models such as ChatGPT, has the potential to significantly impact cybersecurity. In this study, we investigated the impact of GPTs, specifically ChatGPT, on tertiary education in cybersecurity, and provided recommendations for universities to adapt their curricula to meet the evolving needs of the industry. Our research highlighted the importance of understanding the alignment between GPT's ``mental model'' and human cognition, as well as the enhancement of GPT capabilities to human skills based on Bloom's taxonomy. By analyzing current educational practices and the alignment of curricula with industry requirements, we concluded that universities providing practical degrees like cybersecurity should align closely with industry demand and embrace the inevitable generative AI revolution, while applying stringent ethics oversight to safeguard responsible GPT usage. We proposed a set of recommendations focused on updating university curricula, promoting agility within universities, fostering collaboration between academia, industry, and policymakers, and evaluating and assessing educational outcomes.
- [1629] arXiv:2403.11415 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image ManipulationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Reverse sampling and score-distillation have emerged as main workhorses in recent years for image manipulation using latent diffusion models (LDMs). While reverse diffusion sampling often requires adjustments of LDM architecture or feature engineering, score distillation offers a simple yet powerful model-agnostic approach, but it is often prone to mode-collapsing. To address these limitations and leverage the strengths of both approaches, here we introduce a novel framework called {\em DreamSampler}, which seamlessly integrates these two distinct approaches through the lens of regularized latent optimization. Similar to score-distillation, DreamSampler is a model-agnostic approach applicable to any LDM architecture, but it allows both distillation and reverse sampling with additional guidance for image editing and reconstruction. Through experiments involving image editing, SVG reconstruction and etc, we demonstrate the competitive performance of DreamSampler compared to existing approaches, while providing new applications.
- [1630] arXiv:2403.11418 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Variational Sampling of Temporal TrajectoriesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: A deterministic temporal process can be determined by its trajectory, an element in the product space of (a) initial condition $z_0 \in \mathcal{Z}$ and (b) transition function $f: (\mathcal{Z}, \mathcal{T}) \to \mathcal{Z}$ often influenced by the control of the underlying dynamical system. Existing methods often model the transition function as a differential equation or as a recurrent neural network. Despite their effectiveness in predicting future measurements, few results have successfully established a method for sampling and statistical inference of trajectories using neural networks, partially due to constraints in the parameterization. In this work, we introduce a mechanism to learn the distribution of trajectories by parameterizing the transition function $f$ explicitly as an element in a function space. Our framework allows efficient synthesis of novel trajectories, while also directly providing a convenient tool for inference, i.e., uncertainty estimation, likelihood evaluations and out of distribution detection for abnormal trajectories. These capabilities can have implications for various downstream tasks, e.g., simulation and evaluation for reinforcement learning.
- [1631] arXiv:2403.11420 (cross-list from hep-th) [ pdf , ps , html , other ]
-
Title: Neural network representation of quantum systemsComments: 24 pages, 6 figuresSubjects: High Energy Physics - Theory (hep-th) ; Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
Abstract: It has been proposed that random wide neural networks near Gaussian process are quantum field theories around Gaussian fixed points. In this paper, we provide a novel map with which a wide class of quantum mechanical systems can be cast into the form of a neural network with a statistical summation over network parameters. Our simple idea is to use the universal approximation theorem of neural networks to generate arbitrary paths in the Feynman's path integral. The map can be applied to interacting quantum systems / field theories, even away from the Gaussian limit. Our findings bring machine learning closer to the quantum world.
- [1632] arXiv:2403.11432 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Demystifying Deep Reinforcement Learning-Based Autonomous Vehicle Decision-MakingComments: Submitted for peer-reviewSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: With the advent of universal function approximators in the domain of reinforcement learning, the number of practical applications leveraging deep reinforcement learning (DRL) has exploded. Decision-making in automated driving tasks has emerged as a chief application among them, taking the sensor data or the higher-order kinematic variables as the input and providing a discrete choice or continuous control output. However, the black-box nature of the models presents an overwhelming limitation that restricts the real-world deployment of DRL in autonomous vehicles (AVs). Therefore, in this research work, we focus on the interpretability of an attention-based DRL framework. We use a continuous proximal policy optimization-based DRL algorithm as the baseline model and add a multi-head attention framework in an open-source AV simulation environment. We provide some analytical techniques for discussing the interpretability of the trained models in terms of explainability and causality for spatial and temporal correlations. We show that the weights in the first head encode the positions of the neighboring vehicles while the second head focuses on the leader vehicle exclusively. Also, the ego vehicle's action is causally dependent on the vehicles in the target lane spatially and temporally. Through these findings, we reliably show that these techniques can help practitioners decipher the results of the DRL algorithms.
- [1633] arXiv:2403.11456 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language ModelsComments: PreprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: The ubiquitousness of social media has led to the need for reliable and efficient detection of offensive content to limit harmful effects. This has led to a proliferation of datasets and models related to detecting offensive content. While sophisticated models have attained strong performance on individual datasets, these models often do not generalize due to differences between how "offensive content" is conceptualized, and the resulting differences in how these datasets are labeled. In this paper, we introduce HateCOT, a dataset of 52,000 samples drawn from diverse existing sources with explanations generated by GPT-3.5-Turbo and human-curated. We show that pre-training models for the detection of offensive content on HateCOT significantly boots open-sourced Language Models on three benchmark datasets in both zero and few-shot settings, despite differences in domain and task.} We further find that HateCOT enables effective K-shot fine-tuning in the low-resource settings.
- [1634] arXiv:2403.11468 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Collage Prompting: Budget-Friendly Visual Recognition with GPT-4VSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in generative AI have suggested that by taking visual prompt, GPT-4V can demonstrate significant proficiency in image recognition task. Despite its impressive capabilities, the financial cost associated with GPT-4V's inference presents a substantial barrier for its wide use. To address this challenge, our work introduces Collage Prompting, a budget-friendly prompting approach that concatenates multiple images into a single visual input. With collage prompt, GPT-4V is able to perform image recognition on several images simultaneously. Based on the observation that the accuracy of GPT-4V's image recognition varies significantly with the order of images within the collage prompt, our method further learns to optimize the arrangement of images for maximum recognition accuracy. A graph predictor is trained to indicate the accuracy of each collage prompt, then we propose an optimization method to navigate the search space of possible image arrangements. Experiment results across various datasets demonstrate the cost-efficiency score of collage prompt is much larger than standard prompt. Additionally, collage prompt with learned arrangement achieves clearly better accuracy than collage prompt with random arrangement in GPT-4V's visual recognition.
- [1635] arXiv:2403.11473 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Word Order's Impacts: Insights from Reordering and Generation AnalysisSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Existing works have studied the impacts of the order of words within natural text. They usually analyze it by destroying the original order of words to create a scrambled sequence, and then comparing the models' performance between the original and scrambled sequences. The experimental results demonstrate marginal drops. Considering this findings, different hypothesis about word order is proposed, including ``the order of words is redundant with lexical semantics'', and ``models do not rely on word order''. In this paper, we revisit the aforementioned hypotheses by adding a order reconstruction perspective, and selecting datasets of different spectrum. Specifically, we first select four different datasets, and then design order reconstruction and continuing generation tasks. Empirical findings support that ChatGPT relies on word order to infer, but cannot support or negate the redundancy relations between word order lexical semantics.
- [1636] arXiv:2403.11483 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Open-World Semi-Supervised Learning for Node ClassificationYanling Wang , Jing Zhang , Lingxi Zhang , Lixin Liu , Yuxiao Dong , Cuiping Li , Hong Chen , Hongzhi YinComments: Accepted by ICDE 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: Open-world semi-supervised learning (Open-world SSL) for node classification, that classifies unlabeled nodes into seen classes or multiple novel classes, is a practical but under-explored problem in the graph community. As only seen classes have human labels, they are usually better learned than novel classes, and thus exhibit smaller intra-class variances within the embedding space (named as imbalance of intra-class variances between seen and novel classes). Based on empirical and theoretical analysis, we find the variance imbalance can negatively impact the model performance. Pre-trained feature encoders can alleviate this issue via producing compact representations for novel classes. However, creating general pre-trained encoders for various types of graph data has been proven to be challenging. As such, there is a demand for an effective method that does not rely on pre-trained graph encoders. In this paper, we propose an IMbalance-Aware method named OpenIMA for Open-world semi-supervised node classification, which trains the node classification model from scratch via contrastive learning with bias-reduced pseudo labels. Extensive experiments on seven popular graph benchmarks demonstrate the effectiveness of OpenIMA, and the source code has been available on GitHub.
- [1637] arXiv:2403.11487 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction SynthesisComments: 14 PagesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. We finally discuss the applicability of our approach in enabling a generalizable evaluation of embodied navigation policies. To the best of our knowledge, ours is the first LLM-driven approach capable of generating "human-like" instructions in a platform-agnostic manner, without training.
- [1638] arXiv:2403.11492 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion PredictionComments: Camera-ready version for CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. Context information, such as road maps and surrounding agents' states, provides crucial geometric and semantic information for motion behavior prediction. To this end, recent works explore two-stage prediction frameworks where coarse trajectories are first proposed, and then used to select critical context information for trajectory refinement. However, they either incur a large amount of computation or bring limited improvement, if not both. In this paper, we introduce a novel scenario-adaptive refinement strategy, named SmartRefine, to refine prediction with minimal additional computation. Specifically, SmartRefine can comprehensively adapt refinement configurations based on each scenario's properties, and smartly chooses the number of refinement iterations by introducing a quality score to measure the prediction quality and remaining refinement potential of each scenario. SmartRefine is designed as a generic and flexible approach that can be seamlessly integrated into most state-of-the-art motion prediction models. Experiments on Argoverse (1 & 2) show that our method consistently improves the prediction accuracy of multiple state-of-the-art prediction models. Specifically, by adding SmartRefine to QCNet, we outperform all published ensemble-free works on the Argoverse 2 leaderboard (single agent track) at submission. Comprehensive studies are also conducted to ablate design choices and explore the mechanism behind multi-iteration refinement. Codes are available at this https URL
- [1639] arXiv:2403.11495 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Semantic-Enhanced Representation Learning for Road Networks with Temporal DynamicsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In this study, we introduce a novel framework called Toast for learning general-purpose representations of road networks, along with its advanced counterpart DyToast, designed to enhance the integration of temporal dynamics to boost the performance of various time-sensitive downstream tasks. Specifically, we propose to encode two pivotal semantic characteristics intrinsic to road networks: traffic patterns and traveling semantics. To achieve this, we refine the skip-gram module by incorporating auxiliary objectives aimed at predicting the traffic context associated with a target road segment. Moreover, we leverage trajectory data and design pre-training strategies based on Transformer to distill traveling semantics on road networks. DyToast further augments this framework by employing unified trigonometric functions characterized by their beneficial properties, enabling the capture of temporal evolution and dynamic nature of road networks more effectively. With these proposed techniques, we can obtain representations that encode multi-faceted aspects of knowledge within road networks, applicable across both road segment-based applications and trajectory-based applications. Extensive experiments on two real-world datasets across three tasks demonstrate that our proposed framework consistently outperforms the state-of-the-art baselines by a significant margin.
- [1640] arXiv:2403.11496 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: MCD: Diverse Large-Scale Multi-Campus Dataset for Robot PerceptionThien-Minh Nguyen , Shenghai Yuan , Thien Hoang Nguyen , Pengyu Yin , Haozhi Cao , Lihua Xie , Maciej Wozniak , Patric Jensfelt , Marko Thiel , Justin Ziegenbein , Noel BlunderComments: Accepted by The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Perception plays a crucial role in various robot applications. However, existing well-annotated datasets are biased towards autonomous driving scenarios, while unlabelled SLAM datasets are quickly over-fitted, and often lack environment and domain variations. To expand the frontier of these fields, we introduce a comprehensive dataset named MCD (Multi-Campus Dataset), featuring a wide range of sensing modalities, high-accuracy ground truth, and diverse challenging environments across three Eurasian university campuses. MCD comprises both CCS (Classical Cylindrical Spinning) and NRE (Non-Repetitive Epicyclic) lidars, high-quality IMUs (Inertial Measurement Units), cameras, and UWB (Ultra-WideBand) sensors. Furthermore, in a pioneering effort, we introduce semantic annotations of 29 classes over 59k sparse NRE lidar scans across three domains, thus providing a novel challenge to existing semantic segmentation research upon this largely unexplored lidar modality. Finally, we propose, for the first time to the best of our knowledge, continuous-time ground truth based on optimization-based registration of lidar-inertial data on large survey-grade prior maps, which are also publicly released, each several times the size of existing ones. We conduct a rigorous evaluation of numerous state-of-the-art algorithms on MCD, report their performance, and highlight the challenges awaiting solutions from the research community.
- [1641] arXiv:2403.11504 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: MLVICX: Multi-Level Variance-Covariance Exploration for Chest X-ray Self-Supervised Representation LearningSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Self-supervised learning (SSL) is potentially useful in reducing the need for manual annotation and making deep learning models accessible for medical image analysis tasks. By leveraging the representations learned from unlabeled data, self-supervised models perform well on tasks that require little to no fine-tuning. However, for medical images, like chest X-rays, which are characterized by complex anatomical structures and diverse clinical conditions, there arises a need for representation learning techniques that can encode fine-grained details while preserving the broader contextual information. In this context, we introduce MLVICX (Multi-Level Variance-Covariance Exploration for Chest X-ray Self-Supervised Representation Learning), an approach to capture rich representations in the form of embeddings from chest X-ray images. Central to our approach is a novel multi-level variance and covariance exploration strategy that empowers the model to detect diagnostically meaningful patterns while reducing redundancy effectively. By enhancing the variance and covariance of the learned embeddings, MLVICX promotes the retention of critical medical insights by adapting both global and local contextual details. We demonstrate the performance of MLVICX in advancing self-supervised chest X-ray representation learning through comprehensive experiments. The performance enhancements we observe across various downstream tasks highlight the significance of the proposed approach in enhancing the utility of chest X-ray embeddings for precision medical diagnosis and comprehensive image analysis. For pertaining, we used the NIH-Chest X-ray dataset, while for downstream tasks, we utilized NIH-Chest X-ray, Vinbig-CXR, RSNA pneumonia, and SIIM-ACR Pneumothorax datasets. Overall, we observe more than 3% performance gains over SOTA SSL approaches in various downstream tasks.
- [1642] arXiv:2403.11506 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: End-To-End Underwater Video Enhancement: Dataset and ModelSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Underwater video enhancement (UVE) aims to improve the visibility and frame quality of underwater videos, which has significant implications for marine research and exploration. However, existing methods primarily focus on developing image enhancement algorithms to enhance each frame independently. There is a lack of supervised datasets and models specifically tailored for UVE tasks. To fill this gap, we construct the Synthetic Underwater Video Enhancement (SUVE) dataset, comprising 840 diverse underwater-style videos paired with ground-truth reference videos. Based on this dataset, we train a novel underwater video enhancement model, UVENet, which utilizes inter-frame relationships to achieve better enhancement performance. Through extensive experiments on both synthetic and real underwater videos, we demonstrate the effectiveness of our approach. This study represents the first comprehensive exploration of UVE to our knowledge. The code is available at https://anonymous.4open.science/r/UVENet.
- [1643] arXiv:2403.11536 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OCR is All you need: Importing Multi-Modality into Image-based Defect Detection SystemSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Automatic optical inspection (AOI) plays a pivotal role in the manufacturing process, predominantly leveraging high-resolution imaging instruments for scanning purposes. It detects anomalies by analyzing image textures or patterns, making it an essential tool in industrial manufacturing and quality control. Despite its importance, the deployment of models for AOI often faces challenges. These include limited sample sizes, which hinder effective feature learning, variations among source domains, and sensitivities to changes in lighting and camera positions during imaging. These factors collectively compromise the accuracy of model predictions. Traditional AOI often fails to capitalize on the rich mechanism-parameter information from machines or inside images, including statistical parameters, which typically benefit AOI classification. To address this, we introduce an external modality-guided data mining framework, primarily rooted in optical character recognition (OCR), to extract statistical features from images as a second modality to enhance performance, termed OANet (Ocr-Aoi-Net). A key aspect of our approach is the alignment of external modality features, extracted using a single modality-aware model, with image features encoded by a convolutional neural network. This synergy enables a more refined fusion of semantic representations from different modalities. We further introduce feature refinement and a gating function in our OANet to optimize the combination of these features, enhancing inference and decision-making capabilities. Experimental outcomes show that our methodology considerably boosts the recall rate of the defect detection model and maintains high robustness even in challenging scenarios.
- [1644] arXiv:2403.11552 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: LLM3:Large Language Model-based Task and Motion Planning with Motion Failure ReasoningComments: Submitted to IROS 2024. Codes available: this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Conventional Task and Motion Planning (TAMP) approaches rely on manually crafted interfaces connecting symbolic task planning with continuous motion generation. These domain-specific and labor-intensive modules are limited in addressing emerging tasks in real-world settings. Here, we present LLM^3, a novel Large Language Model (LLM)-based TAMP framework featuring a domain-independent interface. Specifically, we leverage the powerful reasoning and planning capabilities of pre-trained LLMs to propose symbolic action sequences and select continuous action parameters for motion planning. Crucially, LLM^3 incorporates motion planning feedback through prompting, allowing the LLM to iteratively refine its proposals by reasoning about motion failure. Consequently, LLM^3 interfaces between task planning and motion planning, alleviating the intricate design process of handling domain-specific messages between them. Through a series of simulations in a box-packing domain, we quantitatively demonstrate the effectiveness of LLM^3 in solving TAMP problems and the efficiency in selecting action parameters. Ablation studies underscore the significant contribution of motion failure reasoning to the success of LLM^3. Furthermore, we conduct qualitative experiments on a physical manipulator, demonstrating the practical applicability of our approach in real-world settings.
- [1645] arXiv:2403.11558 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reinforcement Learning with Token-level Feedback for Controllable Text GenerationComments: Accepted to NAACL 2024 FindingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: To meet the requirements of real-world applications, it is essential to control generations of large language models (LLMs). Prior research has tried to introduce reinforcement learning (RL) into controllable text generation while most existing methods suffer from overfitting issues (finetuning-based methods) or semantic collapse (post-processing methods). However, current RL methods are generally guided by coarse-grained (sentence/paragraph-level) feedback, which may lead to suboptimal performance owing to semantic twists or progressions within sentences. To tackle that, we propose a novel reinforcement learning algorithm named TOLE which formulates TOken-LEvel rewards for controllable text generation, and employs a "first-quantize-then-noise" paradigm to enhance the robustness of the RL algorithm.Furthermore, TOLE can be flexibly extended to multiple constraints with little computational expense. Experimental results show that our algorithm can achieve superior performance on both single-attribute and multi-attribute control tasks. We have released our codes at this https URL
- [1646] arXiv:2403.11585 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Linguacodus: A Synergistic Framework for Transformative Code Generation in Machine Learning PipelinesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
Abstract: In the ever-evolving landscape of machine learning, seamless translation of natural language descriptions into executable code remains a formidable challenge. This paper introduces Linguacodus, an innovative framework designed to tackle this challenge by deploying a dynamic pipeline that iteratively transforms natural language task descriptions into code through high-level data-shaping instructions. The core of Linguacodus is a fine-tuned large language model (LLM), empowered to evaluate diverse solutions for various problems and select the most fitting one for a given task. This paper details the fine-tuning process, and sheds light on how natural language descriptions can be translated into functional code. Linguacodus represents a substantial leap towards automated code generation, effectively bridging the gap between task descriptions and executable code. It holds great promise for advancing machine learning applications across diverse domains. Additionally, we propose an algorithm capable of transforming a natural description of an ML task into code with minimal human interaction. In extensive experiments on a vast machine learning code dataset originating from Kaggle, we showcase the effectiveness of Linguacodus. The investigations highlight its potential applications across diverse domains, emphasizing its impact on applied machine learning in various scientific fields.
- [1647] arXiv:2403.11598 (cross-list from quant-ph) [ pdf , ps , other ]
-
Title: Optimal Layout Synthesis for Deep Quantum Circuits on NISQ Processors with 100+ QubitsComments: 7 Figures, 4 Tables, 1 ListingSubjects: Quantum Physics (quant-ph) ; Artificial Intelligence (cs.AI)
Abstract: Layout synthesis is mapping a quantum circuit to a quantum processor. SWAP gate insertions are needed for scheduling 2-qubit gates only on connected physical qubits. With the ever-increasing number of qubits in NISQ processors, scalable layout synthesis is of utmost importance. With large optimality gaps observed in heuristic approaches, scalable exact methods are needed. While recent exact and near-optimal approaches scale to moderate circuits, large deep circuits are still out of scope.
In this work, we propose a SAT encoding based on parallel plans that apply 1 SWAP and a group of CNOTs at each time step. Using domain-specific information, we maintain optimality in parallel plans while scaling to large and deep circuits. From our results, we show the scalability of our approach which significantly outperforms leading exact and near-optimal approaches (up to 100x). For the first time, we can optimally map several 8, 14, and 16 qubit circuits onto 54, 80, and 127 qubit platforms with up to 17 SWAPs. While adding optimal SWAPs, we also report near-optimal depth in our mapped circuits. - [1648] arXiv:2403.11626 (cross-list from cs.GR) [ pdf , ps , html , other ]
-
Title: QEAN: Quaternion-Enhanced Attention Network for Visual Dance GenerationComments: Accepted by The Visual Computer JournalSubjects: Graphics (cs.GR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract: The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for visual dance synthesis from a quaternion perspective, which consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences, and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from this https URL and this https URL respectively.
- [1649] arXiv:2403.11671 (cross-list from cs.AR) [ pdf , ps , html , other ]
-
Title: HDLdebugger: Streamlining HDL debugging with Large Language ModelsComments: 13 pages,5 figuresSubjects: Hardware Architecture (cs.AR) ; Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Software Engineering (cs.SE)
Abstract: In the domain of chip design, Hardware Description Languages (HDLs) play a pivotal role. However, due to the complex syntax of HDLs and the limited availability of online resources, debugging HDL codes remains a difficult and time-intensive task, even for seasoned engineers. Consequently, there is a pressing need to develop automated HDL code debugging models, which can alleviate the burden on hardware engineers. Despite the strong capabilities of Large Language Models (LLMs) in generating, completing, and debugging software code, their utilization in the specialized field of HDL debugging has been limited and, to date, has not yielded satisfactory results. In this paper, we propose an LLM-assisted HDL debugging framework, namely HDLdebugger, which consists of HDL debugging data generation via a reverse engineering approach, a search engine for retrieval-augmented generation, and a retrieval-augmented LLM fine-tuning approach. Through the integration of these components, HDLdebugger can automate and streamline HDL debugging for chip design. Our comprehensive experiments, conducted on an HDL code dataset sourced from Huawei, reveal that HDLdebugger outperforms 13 cutting-edge LLM baselines, displaying exceptional effectiveness in HDL code debugging.
- [1650] arXiv:2403.11703 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution ImagesRuyi Xu , Yuan Yao , Zonghao Guo , Junbo Cui , Zanlin Ni , Chunjiang Ge , Tat-Seng Chua , Zhiyuan Liu , Maosong Sun , Gao HuangComments: PreprintSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at this https URL .
- [1651] arXiv:2403.11755 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Meta-Prompting for Automating Zero-shot Visual Recognition with LLMsM. Jehanzeb Mirza , Leonid Karlinsky , Wei Lin , Sivan Doveh , Jakub Micorek , Mateusz Kozinski , Hilde Kuhene , Horst PosseggerComments: Project Page (Code and Data): this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively
- [1652] arXiv:2403.11772 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attentionComments: Submitted to 9th Graz BCI Conference 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Motivated by the challenge of seamless cross-dataset transfer in EEG signal processing, this article presents an exploratory study on the use of Joint Embedding Predictive Architectures (JEPAs). In recent years, self-supervised learning has emerged as a promising approach for transfer learning in various domains. However, its application to EEG signals remains largely unexplored. In this article, we introduce Signal-JEPA for representing EEG recordings which includes a novel domain-specific spatial block masking strategy and three novel architectures for downstream classification. The study is conducted on a 54~subjects dataset and the downstream performance of the models is evaluated on three different BCI paradigms: motor imagery, ERP and SSVEP. Our study provides preliminary evidence for the potential of JEPAs in EEG signal encoding. Notably, our results highlight the importance of spatial filtering for accurate downstream classification and reveal an influence of the length of the pre-training examples but not of the mask size on the downstream performance.
- [1653] arXiv:2403.11780 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language PromptYongqi Wang , Ruofan Hu , Rongjie Huang , Zhiqing Hong , Ruiqi Li , Wenrui Liu , Fuming You , Tao Jin , Zhou ZhaoComments: Accepted by NAACL 2024 (main conference)Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Abstract: Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at this http URL .
- [1654] arXiv:2403.11786 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Construction of Hyper-Relational Knowledge Graphs Using Pre-Trained Large Language ModelsComments: 5 pages + referencesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Extracting hyper-relations is crucial for constructing comprehensive knowledge graphs, but there are limited supervised methods available for this task. To address this gap, we introduce a zero-shot prompt-based method using OpenAI's GPT-3.5 model for extracting hyper-relational knowledge from text. Comparing our model with a baseline, we achieved promising results, with a recall of 0.77. Although our precision is currently lower, a detailed analysis of the model outputs has uncovered potential pathways for future research in this area.
- [1655] arXiv:2403.11790 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Deep Medial Voxels: Learned Medial Axis Approximations for Anatomical Shape ModelingComments: 10 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Shape reconstruction from imaging volumes is a recurring need in medical image analysis. Common workflows start with a segmentation step, followed by careful post-processing and,finally, ad hoc meshing algorithms. As this sequence can be timeconsuming, neural networks are trained to reconstruct shapes through template deformation. These networks deliver state-ofthe-art results without manual intervention, but, so far, they have primarily been evaluated on anatomical shapes with little topological variety between individuals. In contrast, other works favor learning implicit shape models, which have multiple benefits for meshing and visualization. Our work follows this direction by introducing deep medial voxels, a semi-implicit representation that faithfully approximates the topological skeleton from imaging volumes and eventually leads to shape reconstruction via convolution surfaces. Our reconstruction technique shows potential for both visualization and computer simulations.
- [1656] arXiv:2403.11793 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning CorpusSeungpil Lee , Woochang Sim , Donghyeon Shin , Sanha Hwang , Wongyu Seo , Jiwon Park , Seokki Lee , Sejin Kim , Sundong KimComments: 25 pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Symbolic Computation (cs.SC)
Abstract: The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been results-centric, making it difficult to assess the inference process. We introduce a new approach using the Abstract and Reasoning Corpus (ARC) dataset to evaluate the inference and contextual understanding abilities of large language models in a process-centric manner. ARC demands rigorous logical structures for problem-solving, making it a benchmark that facilitates the comparison of model inference abilities with humans. Experimental results confirm that while large language models possess weak inference abilities, they still lag in terms of logical coherence, compositionality, and productivity. Our experiments highlight the reasoning capabilities of LLMs, proposing development paths for achieving human-level reasoning.
- [1657] arXiv:2403.11821 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Evaluating Text-to-Image Synthesis: Survey and Taxonomy of Image Quality MetricsSebastian Hartwig , Dominik Engel , Leon Sick , Hannah Kniesel , Tristan Payer , Poonam Poonam , Michael Glöckler , Alex Bäuerle , Timo RopinskiComments: preprint, 20 pages, 2 figures, 1 tableSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: Recent advances in text-to-image synthesis enabled through a combination of language and vision foundation models have led to a proliferation of the tools available and an increased attention to the field. When conducting text-to-image synthesis, a central goal is to ensure that the content between text and image is aligned. As such, there exist numerous evaluation metrics that aim to mimic human judgement. However, it is often unclear which metric to use for evaluating text-to-image synthesis systems as their evaluation is highly nuanced. In this work, we provide a comprehensive overview of existing text-to-image evaluation metrics. Based on our findings, we propose a new taxonomy for categorizing these metrics. Our taxonomy is grounded in the assumption that there are two main quality criteria, namely compositionality and generality, which ideally map to human preferences. Ultimately, we derive guidelines for practitioners conducting text-to-image evaluation, discuss open challenges of evaluation mechanisms, and surface limitations of current metrics.
- [1658] arXiv:2403.11830 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Problem space structural adversarial attacks for Network Intrusion Detection Systems based on Graph Neural NetworksComments: preprint submitted to IEEE TIFS, under reviewSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Machine Learning (ML) algorithms have become increasingly popular for supporting Network Intrusion Detection Systems (NIDS). Nevertheless, extensive research has shown their vulnerability to adversarial attacks, which involve subtle perturbations to the inputs of the models aimed at compromising their performance. Recent proposals have effectively leveraged Graph Neural Networks (GNN) to produce predictions based also on the structural patterns exhibited by intrusions to enhance the detection robustness. However, the adoption of GNN-based NIDS introduces new types of risks. In this paper, we propose the first formalization of adversarial attacks specifically tailored for GNN in network intrusion detection. Moreover, we outline and model the problem space constraints that attackers need to consider to carry out feasible structural attacks in real-world scenarios. As a final contribution, we conduct an extensive experimental campaign in which we launch the proposed attacks against state-of-the-art GNN-based NIDS. Our findings demonstrate the increased robustness of the models against classical feature-based adversarial attacks, while highlighting their susceptibility to structure-based attacks.
- [1659] arXiv:2403.11838 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language ModelsYi Luo , Zhenghao Lin , Yuhao Zhang , Jiashuo Sun , Chen Lin , Chengjin Xu , Xiangdong Su , Yelong Shen , Jian Guo , Yeyun GongComments: Accepted to NAACL 2024 main conferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage. Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model. We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.
- [1660] arXiv:2403.11841 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline DataSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In real-world scenarios, datasets collected from randomized experiments are often constrained by size, due to limitations in time and budget. As a result, leveraging large observational datasets becomes a more attractive option for achieving high-quality policy learning. However, most existing offline reinforcement learning (RL) methods depend on two key assumptions--unconfoundedness and positivity--which frequently do not hold in observational data contexts. Recognizing these challenges, we propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on front-door criterion to remove the confounding bias; additionally, we adopt the pessimistic principle to address the distributional shift between the action distributions induced by candidate policies, and the behavior policy that generates the observational data. Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
- [1661] arXiv:2403.11843 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Fuzzy Rough Choquet Distances for ClassificationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces a novel Choquet distance using fuzzy rough set based measures. The proposed distance measure combines the attribute information received from fuzzy rough set theory with the flexibility of the Choquet integral. This approach is designed to adeptly capture non-linear relationships within the data, acknowledging the interplay of the conditional attributes towards the decision attribute and resulting in a more flexible and accurate distance. We explore its application in the context of machine learning, with a specific emphasis on distance-based classification approaches (e.g. k-nearest neighbours). The paper examines two fuzzy rough set based measures that are based on the positive region. Moreover, we explore two procedures for monotonizing the measures derived from fuzzy rough set theory, making them suitable for use with the Choquet integral, and investigate their differences.
- [1662] arXiv:2403.11852 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Reinforcement Learning with Latent State Inference for Autonomous On-ramp Merging under Observation DelaySubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a novel approach to address the challenging problem of autonomous on-ramp merging, where a self-driving vehicle needs to seamlessly integrate into a flow of vehicles on a multi-lane highway. We introduce the Lane-keeping, Lane-changing with Latent-state Inference and Safety Controller (L3IS) agent, designed to perform the on-ramp merging task safely without comprehensive knowledge about surrounding vehicles' intents or driving styles. We also present an augmentation of this agent called AL3IS that accounts for observation delays, allowing the agent to make more robust decisions in real-world environments with vehicle-to-vehicle (V2V) communication delays. By modeling the unobservable aspects of the environment through latent states, such as other drivers' intents, our approach enhances the agent's ability to adapt to dynamic traffic conditions, optimize merging maneuvers, and ensure safe interactions with other vehicles. We demonstrate the effectiveness of our method through extensive simulations generated from real traffic data and compare its performance with existing approaches. L3IS shows a 99.90% success rate in a challenging on-ramp merging case generated from the real US Highway 101 data. We further perform a sensitivity analysis on AL3IS to evaluate its robustness against varying observation delays, which demonstrates an acceptable performance of 93.84% success rate in 1-second V2V communication delay.
- [1663] arXiv:2403.11865 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Exploring Multi-modal Neural Scene Representations With Applications on Thermal ImagingComments: 24 pages, 14 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: Neural Radiance Fields (NeRFs) quickly evolved as the new de-facto standard for the task of novel view synthesis when trained on a set of RGB images. In this paper, we conduct a comprehensive evaluation of neural scene representations, such as NeRFs, in the context of multi-modal learning. Specifically, we present four different strategies of how to incorporate a second modality, other than RGB, into NeRFs: (1) training from scratch independently on both modalities; (2) pre-training on RGB and fine-tuning on the second modality; (3) adding a second branch; and (4) adding a separate component to predict (color) values of the additional modality. We chose thermal imaging as second modality since it strongly differs from RGB in terms of radiosity, making it challenging to integrate into neural scene representations. For the evaluation of the proposed strategies, we captured a new publicly available multi-view dataset, ThermalMix, consisting of six common objects and about 360 RGB and thermal images in total. We employ cross-modality calibration prior to data capturing, leading to high-quality alignments between RGB and thermal images. Our findings reveal that adding a second branch to NeRF performs best for novel view synthesis on thermal images while also yielding compelling results on RGB. Finally, we also show that our analysis generalizes to other modalities, including near-infrared images and depth maps. Project page: this https URL .
- [1664] arXiv:2403.11879 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Unimodal Multi-Task Fusion for Emotional Mimicry PredictionSubjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: In this study, we propose a methodology for the Emotional Mimicry Intensity (EMI) Estimation task within the context of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our approach leverages the Wav2Vec 2.0 framework, pre-trained on a comprehensive podcast dataset, to extract a broad range of audio features encompassing both linguistic and paralinguistic elements. We enhance feature representation through a fusion technique that integrates individual features with a global mean vector, introducing global contextual insights into our analysis. Additionally, we incorporate a pre-trained valence-arousal-dominance (VAD) module from the Wav2Vec 2.0 model. Our fusion employs a Long Short-Term Memory (LSTM) architecture for efficient temporal analysis of audio data. Utilizing only the provided audio data, our approach demonstrates significant improvements over the established baseline.
- [1665] arXiv:2403.11882 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ReGenNet: Towards Human Action-Reaction SynthesisLiang Xu , Yizhou Zhou , Yichao Yan , Xin Jin , Wenhan Zhu , Fengyun Rao , Xiaokang Yang , Wenjun ZengComments: Accepted by CVPR 2024, Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper, we comprehensively analyze the asymmetric, dynamic, synchronous, and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with, we propose to annotate the actor-reactor order of the interaction sequences for the NTU120, InterHuman, and Chi3D datasets. Based on them, a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines, and can generalize to unseen actor motions and viewpoint changes.
- [1666] arXiv:2403.11886 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: QueryAgent: A Reliable and Efficient Reasoning Framework with Environmental Feedback based Self-CorrectionComments: ACL 2024 under reviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Employing Large Language Models (LLMs) for semantic parsing has achieved remarkable success. However, we find existing methods fall short in terms of reliability and efficiency when hallucinations are encountered. In this paper, we address these challenges with a framework called QueryAgent, which solves a question step-by-step and performs step-wise self-correction. We introduce an environmental feedback-based self-correction method called ERASER. Unlike traditional approaches, ERASER leverages rich environmental feedback in the intermediate steps to perform selective and differentiated self-correction only when necessary. Experimental results demonstrate that QueryAgent notably outperforms all previous few-shot methods using only one example on GrailQA and GraphQ by 7.0 and 15.0 F1. Moreover, our approach exhibits superiority in terms of efficiency, including runtime, query overhead, and API invocation costs. By leveraging ERASER, we further improve another baseline (i.e., AgentBench) by approximately 10 points, revealing the strong transferability of our approach.
- [1667] arXiv:2403.11887 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention ModulesXiangyu Chen , Jing Liu , Ye Wang , Pu Perry Wang , Matthew Brand , Guanghui Wang , Toshiaki Koike-AkinoComments: 33 pages, 29 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Low-rank adaptation (LoRA) and its variants are widely employed in fine-tuning large models, including large language models for natural language processing and diffusion models for computer vision. This paper proposes a generalized framework called SuperLoRA that unifies and extends different LoRA variants, which can be realized under different hyper-parameter settings. Introducing grouping, folding, shuffling, projecting, and tensor factoring, SuperLoRA offers high flexibility compared with other LoRA variants and demonstrates superior performance for transfer learning tasks especially in the extremely few-parameter regimes.
- [1668] arXiv:2403.11894 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: From Explainable to Interpretable Deep Learning for Natural Language Processing in Healthcare: How Far from Reality?Comments: This paper has been accepted by Computational and Structural Biotechnology JournalSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep learning (DL) has substantially enhanced natural language processing (NLP) in healthcare research. However, the increasing complexity of DL-based NLP necessitates transparent model interpretability, or at least explainability, for reliable decision-making. This work presents a thorough scoping review of explainable and interpretable DL in healthcare NLP. The term "eXplainable and Interpretable Artificial Intelligence" (XIAI) is introduced to distinguish XAI from IAI. Different models are further categorized based on their functionality (model-, input-, output-based) and scope (local, global). Our analysis shows that attention mechanisms are the most prevalent emerging IAI technique. The use of IAI is growing, distinguishing it from XAI. The major challenges identified are that most XIAI does not explore "global" modelling processes, the lack of best practices, and the lack of systematic evaluation and benchmarks. One important opportunity is to use attention mechanisms to enhance multi-modal XIAI for personalized medicine. Additionally, combining DL with causal logic holds promise. Our discussion encourages the integration of XIAI in Large Language Models (LLMs) and domain-specific smaller models. In conclusion, XIAI adoption in healthcare requires dedicated in-house expertise. Collaboration with domain experts, end-users, and policymakers can lead to ready-to-use XIAI methods across NLP and medical tasks. While challenges exist, XIAI techniques offer a valuable foundation for interpretable NLP algorithms in healthcare.
- [1669] arXiv:2403.11901 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Larimar: Large Language Models with Episodic Memory ControlPayel Das , Subhajit Chaudhury , Elliot Nelson , Igor Melnyk , Sarath Swaminathan , Sihui Dai , Aurélie Lozano , Georgios Kollias , Vijil Chenthamarakshan , Jiří , Navrátil , Soham Dan , Pin-Yu ChenSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Efficient and accurate updating of knowledge stored in Large Language Models (LLMs) is one of the most pressing research challenges today. This paper presents Larimar - a novel, brain-inspired architecture for enhancing LLMs with a distributed episodic memory. Larimar's memory allows for dynamic, one-shot updates of knowledge without the need for computationally expensive re-training or fine-tuning. Experimental results on multiple fact editing benchmarks demonstrate that Larimar attains accuracy comparable to most competitive baselines, even in the challenging sequential editing setup, but also excels in speed - yielding speed-ups of 4-10x depending on the base LLM - as well as flexibility due to the proposed architecture being simple, LLM-agnostic, and hence general. We further provide mechanisms for selective fact forgetting and input context length generalization with Larimar and show their effectiveness.
- [1670] arXiv:2403.11942 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Exploring Facial Expression Recognition through Semi-Supervised Pretraining and Temporal ModelingJun Yu , Zhihong Wei , Zhongpeng Cai , Gongpeng Zhao , Zerui Zhang , Yongqi Wang , Guochen Xie , Jichao Zhu , Wangyuan ZhuSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Facial Expression Recognition (FER) plays a crucial role in computer vision and finds extensive applications across various fields. This paper aims to present our approach for the upcoming 6th Affective Behavior Analysis in-the-Wild (ABAW) competition, scheduled to be held at CVPR2024. In the facial expression recognition task, The limited size of the FER dataset poses a challenge to the expression recognition model's generalization ability, resulting in subpar recognition performance. To address this problem, we employ a semi-supervised learning technique to generate expression category pseudo-labels for unlabeled face data. At the same time, we uniformly sampled the labeled facial expression samples and implemented a debiased feedback learning strategy to address the problem of category imbalance in the dataset and the possible data bias in semi-supervised learning. Moreover, to further compensate for the limitation and bias of features obtained only from static images, we introduced a Temporal Encoder to learn and capture temporal relationships between neighbouring expression image features. In the 6th ABAW competition, our method achieved outstanding results on the official validation set, a result that fully confirms the effectiveness and competitiveness of our proposed method.
- [1671] arXiv:2403.11959 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action CountingComments: Source code: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Video Action Counting (VAC) is crucial in analyzing sports, fitness, and everyday activities by quantifying repetitive actions in videos. However, traditional VAC methods have overlooked the complexity of action repetitions, such as interruptions and the variability in cycle duration. Our research addresses the shortfall by introducing a novel approach to VAC, called Irregular Video Action Counting (IVAC). IVAC prioritizes modeling irregular repetition patterns in videos, which we define through two primary aspects: Inter-cycle Consistency and Cycle-interval Inconsistency. Inter-cycle Consistency ensures homogeneity in the spatial-temporal representations of cycle segments, signifying action uniformity within cycles. Cycle-interval inconsistency highlights the importance of distinguishing between cycle segments and intervals based on their inherent content differences. To encapsulate these principles, we propose a new methodology that includes consistency and inconsistency modules, supported by a unique pull-push loss (P2L) mechanism. The IVAC-P2L model applies a pull loss to promote coherence among cycle segment features and a push loss to clearly distinguish features of cycle segments from interval segments. Empirical evaluations conducted on the RepCount dataset demonstrate that the IVAC-P2L model sets a new benchmark in VAC task performance. Furthermore, the model demonstrates exceptional adaptability and generalization across various video contents, outperforming existing models on two additional datasets, UCFRep and Countix, without the need for dataset-specific optimization. These results confirm the efficacy of our approach in addressing irregular repetitions in videos and pave the way for further advancements in video analysis and understanding.
- [1672] arXiv:2403.11961 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhanced Event-Based Video Reconstruction with Motion CompensationComments: 22 pages, 8 figures (supplementary material included)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep neural networks for event-based video reconstruction often suffer from a lack of interpretability and have high memory demands. A lightweight network called CISTA-LSTC has recently been introduced showing that high-quality reconstruction can be achieved through the systematic design of its architecture. However, its modelling assumption that input signals and output reconstructed frame share the same sparse representation neglects the displacement caused by motion. To address this, we propose warping the input intensity frames and sparse codes to enhance reconstruction quality. A CISTA-Flow network is constructed by integrating a flow network with CISTA-LSTC for motion compensation. The system relies solely on events, in which predicted flow aids in reconstruction and then reconstructed frames are used to facilitate flow estimation. We also introduce an iterative training framework for this combined system. Results demonstrate that our approach achieves state-of-the-art reconstruction accuracy and simultaneously provides reliable dense flow estimation. Furthermore, our model exhibits flexibility in that it can integrate different flow networks, suggesting its potential for further performance enhancement.
- [1673] arXiv:2403.11966 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Informed Spectral Normalized Gaussian Processes for Trajectory PredictionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Prior parameter distributions provide an elegant way to represent prior expert and world knowledge for informed learning. Previous work has shown that using such informative priors to regularize probabilistic deep learning (DL) models increases their performance and data-efficiency. However, commonly used sampling-based approximations for probabilistic DL models can be computationally expensive, requiring multiple inference passes and longer training times. Promising alternatives are compute-efficient last layer kernel approximations like spectral normalized Gaussian processes (SNGPs). We propose a novel regularization-based continual learning method for SNGPs, which enables the use of informative priors that represent prior knowledge learned from previous tasks. Our proposal builds upon well-established methods and requires no rehearsal memory or parameter expansion. We apply our informed SNGP model to the trajectory prediction problem in autonomous driving by integrating prior drivability knowledge. On two public datasets, we investigate its performance under diminishing training data and across locations, and thereby demonstrate an increase in data-efficiency and robustness to location-transfers over non-informed and informed baselines.
- [1674] arXiv:2403.11984 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Using Generative Text Models to Create Qualitative Codebooks for Student Evaluations of TeachingComments: Natural language processing, large language models, generative AI, student evaluations of teaching, codebook generation, qualitative data analysisSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Feedback is a critical aspect of improvement. Unfortunately, when there is a lot of feedback from multiple sources, it can be difficult to distill the information into actionable insights. Consider student evaluations of teaching (SETs), which are important sources of feedback for educators. They can give instructors insights into what worked during a semester. A collection of SETs can also be useful to administrators as signals for courses or entire programs. However, on a large scale as in high-enrollment courses or administrative records over several years, the volume of SETs can render them difficult to analyze. In this paper, we discuss a novel method for analyzing SETs using natural language processing (NLP) and large language models (LLMs). We demonstrate the method by applying it to a corpus of 5,000 SETs from a large public university. We show that the method can be used to extract, embed, cluster, and summarize the SETs to identify the themes they express. More generally, this work illustrates how to use the combination of NLP techniques and LLMs to generate a codebook for SETs. We conclude by discussing the implications of this method for analyzing SETs and other types of student writing in teaching and research settings.
- [1675] arXiv:2403.11996 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph ReasoningSubjects: Machine Learning (cs.LG) ; Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Leveraging generative Artificial Intelligence (AI), we have transformed a dataset comprising 1,000 scientific papers into an ontological knowledge graph. Through an in-depth structural analysis, we have calculated node degrees, identified communities and connectivities, and evaluated clustering coefficients and betweenness centrality of pivotal nodes, uncovering fascinating knowledge architectures. The graph has an inherently scale-free nature, is highly connected, and can be used for graph reasoning by taking advantage of transitive and isomorphic properties that reveal unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, propose never-before-seen material designs, and predict material behaviors. We compute deep node embeddings for combinatorial node similarity ranking for use in a path sampling strategy links dissimilar concepts that have previously not been related. One comparison revealed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. In another example, the algorithm proposed a hierarchical mycelium-based composite based on integrating path sampling with principles extracted from Kandinsky's 'Composition VII' painting. The resulting material integrates an innovative set of concepts that include a balance of chaos/order, adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across science, technology and art, revealing a nuanced ontology of immanence that reveal a context-dependent heterarchical interplay of constituents. Graph-based generative AI achieves a far higher degree of novelty, explorative capacity, and technical detail, than conventional approaches and establishes a widely useful framework for innovation by revealing hidden connections.
- [1676] arXiv:2403.12000 (cross-list from cs.SD) [ pdf , ps , html , other ]
-
Title: Notochord: a Flexible Probabilistic Model for Real-Time MIDI PerformanceComments: 12 pages, 6 figures. Proceedings of the 3rd Conference on AI Music Creativity (2022, September 17)Subjects: Sound (cs.SD) ; Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Abstract: Deep learning-based probabilistic models of musical data are producing increasingly realistic results and promise to enter creative workflows of many kinds. Yet they have been little-studied in a performance setting, where the results of user actions typically ought to feel instantaneous. To enable such study, we designed Notochord, a deep probabilistic model for sequences of structured events, and trained an instance of it on the Lakh MIDI dataset. Our probabilistic formulation allows interpretable interventions at a sub-event level, which enables one model to act as a backbone for diverse interactive musical functions including steerable generation, harmonization, machine improvisation, and likelihood-based interfaces. Notochord can generate polyphonic and multi-track MIDI, and respond to inputs with latency below ten milliseconds. Training code, model checkpoints and interactive examples are provided as open source software.
- [1677] arXiv:2403.12002 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video EditingComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.
- [1678] arXiv:2403.12009 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Leveraging Spatial and Semantic Feature Extraction for Skin Cancer Diagnosis with Capsule Networks and Graph Neural NetworksComments: This is the first version of our paper, we gladly expect feedback and corrections if there is any mistake within our paperSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In the realm of skin lesion image classification, the intricate spatial and semantic features pose significant challenges for conventional Convolutional Neural Network (CNN)-based methodologies. These challenges are compounded by the imbalanced nature of skin lesion datasets, which hampers the ability of models to learn minority class features effectively. Despite augmentation strategies, such as those using Generative Adversarial Networks (GANs), previous attempts have not fully addressed these complexities. This study introduces an innovative approach by integrating Graph Neural Networks (GNNs) with Capsule Networks to enhance classification performance. GNNs, known for their proficiency in handling graph-structured data, offer an advanced mechanism for capturing complex patterns and relationships beyond the capabilities of traditional CNNs. Capsule Networks further contribute by providing superior recognition of spatial hierarchies within images. Our research focuses on evaluating and enhancing the Tiny Pyramid Vision GNN (Tiny Pyramid ViG) architecture by incorporating it with a Capsule Network. This hybrid model was applied to the MNIST:HAM10000 dataset, a comprehensive skin lesion dataset designed for benchmarking classification models. After 75 epochs of training, our model achieved a significant accuracy improvement, reaching 89.23% and 95.52%, surpassing established benchmarks such as GoogLeNet (83.94%), InceptionV3 (86.82%), MobileNet V3 (89.87%), EfficientNet-B7 (92.07%), ResNet18 (92.22%), ResNet34 (91.90%), ViT-Base (73.70%), and IRv2-SA (93.47%) on the same dataset. This outcome underscores the potential of our approach in overcoming the inherent challenges of skin lesion classification, contributing to the advancement of image-based diagnosis in dermatology.
- [1679] arXiv:2403.12010 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: VideoMV: Consistent Multi-View Generation Based on Large Video Generative ModelQi Zuo , Xiaodong Gu , Lingteng Qiu , Yuan Dong , Zhengyi Zhao , Weihao Yuan , Rui Peng , Siyu Zhu , Zilong Dong , Liefeng Bo , Qixing HuangComments: Project page: this http URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is this http URL .
- [1680] arXiv:2403.12014 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EnvGen: Generating and Adapting Environments via LLMs for Training Embodied AgentsComments: First two authors contributed equally; Project website: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger performance than previous smaller agents based on reinforcement learning (RL); however, frequently calling LLMs is slow and expensive. Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at? We propose EnvGen, a novel framework to address this question. First, we prompt an LLM to generate training environments that allow agents to quickly learn different tasks in parallel. Concretely, the LLM is given the task description and simulator objectives that the agents should learn and is then asked to generate a set of environment configurations (e.g., different terrains, items given to agents, etc.). Next, we train a small RL agent in a mixture of the original and LLM-generated environments. Then, we enable the LLM to continuously adapt the generated environments to progressively improve the skills that the agent is weak at, by providing feedback to the LLM in the form of the agent's performance. We demonstrate the usefulness of EnvGen with comprehensive experiments in Crafter and Heist environments. We find that a small RL agent trained with EnvGen can outperform SOTA methods, including a GPT-4 agent, and learns long-horizon tasks significantly faster. We show qualitatively how the LLM adapts training environments to help improve RL agents' weaker skills over time. Additionally, EnvGen is substantially more efficient as it only uses a small number of LLM calls (e.g., 4 in total), whereas LLM agents require thousands of LLM calls. Lastly, we present detailed ablation studies for our design choices.
- [1681] arXiv:2403.12017 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Supervised Fine-Tuning as Inverse Reinforcement LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The prevailing approach to aligning Large Language Models (LLMs) typically relies on human or AI feedback and assumes access to specific types of preference datasets. In our work, we question the efficacy of such datasets and explore various scenarios where alignment with expert demonstrations proves more realistic. We build a sequential decision-making framework to formulate the problem of aligning LLMs using demonstration datasets. Drawing insights from inverse reinforcement learning and imitation learning, we introduce various approaches for divergence minimization in the LLM alignment tasks. Our analysis highlights the mass-covering and mode-seeking behaviors of these different approaches. Inclusively, we examine the pros and cons of the classical supervised fine-tuning method, elaborating on scenarios where different methods shine.
- [1682] arXiv:2403.12026 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FlexCap: Generating Rich, Localized, and Flexible Captions in ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications.
First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: this https URL . - [1683] arXiv:2403.12027 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation ModelsKung-Hsiang Huang , Hou Pong Chan , Yi R. Fung , Haoyi Qiu , Mingyang Zhou , Shafiq Joty , Shih-Fu Chang , Heng JiSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. We review fundamental building blocks crucial for studying chart understanding tasks. Additionally, we explore various tasks and their evaluation metrics and sources of both charts and textual inputs. Various modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed, highlighting the importance of several topics, such as domain-specific charts, lack of efforts in developing evaluation metrics, and agent-oriented settings. This survey paper serves as a comprehensive resource for researchers and practitioners in the fields of natural language processing, computer vision, and data analysis, providing valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: this https URL .
- [1684] arXiv:2403.12028 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and DetailComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Abstract: 3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, \emph{Ultraman} greatly improves the reconstruction speed and accuracy while preserving high-quality texture details. We present a set of new frameworks for human reconstruction consisting of three parts, geometric reconstruction, texture generation and texture mapping. Firstly, a mesh reconstruction framework is used, which accurately extracts 3D human shapes from a single image. At the same time, we propose a method to generate a multi-view consistent image of the human body based on a single image. This is finally combined with a novel texture mapping method to optimize texture details and ensure color consistency during reconstruction. Through extensive experiments and evaluations, we demonstrate the superior performance of \emph{Ultraman} on various standard datasets. In addition, \emph{Ultraman} outperforms state-of-the-art methods in terms of human rendering quality and speed. Upon acceptance of the article, we will make the code and data publicly available.
- [1685] arXiv:2403.12029 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Align and Distill: Unifying and Improving Domain Adaptive Object DetectionJustin Kay , Timm Haucke , Suzanne Stathatos , Siqi Deng , Erik Young , Pietro Perona , Sara Beery , Grant Van HornComments: 30 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Object detectors often perform poorly on data that differs from their training set. Domain adaptive object detection (DAOD) methods have recently demonstrated strong results on addressing this challenge. Unfortunately, we identify systemic benchmarking pitfalls that call past results into question and hamper further progress: (a) Overestimation of performance due to underpowered baselines, (b) Inconsistent implementation practices preventing transparent comparisons of methods, and (c) Lack of generality due to outdated backbones and lack of diversity in benchmarks. We address these problems by introducing: (1) A unified benchmarking and implementation framework, Align and Distill (ALDI), enabling comparison of DAOD methods and supporting future development, (2) A fair and modern training and evaluation protocol for DAOD that addresses benchmarking pitfalls, (3) A new DAOD benchmark dataset, CFC-DAOD, enabling evaluation on diverse real-world data, and (4) A new method, ALDI++, that achieves state-of-the-art results by a large margin. ALDI++ outperforms the previous state-of-the-art by +3.5 AP50 on Cityscapes to Foggy Cityscapes, +5.7 AP50 on Sim10k to Cityscapes (where ours is the only method to outperform a fair baseline), and +2.0 AP50 on CFC Kenai to Channel. Our framework, dataset, and state-of-the-art method offer a critical reset for DAOD and provide a strong foundation for future research. Code and data are available: this https URL and this https URL .
- [1686] arXiv:2403.12031 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: RouterBench: A Benchmark for Multi-LLM Routing SystemQitian Jason Hu , Jacob Bieker , Xiuyu Li , Nan Jiang , Benjamin Keigwin , Gaurav Ranganath , Kurt Keutzer , Shriyash Kaustubh UpadhyaySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through RouterBench, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at this https URL .
- [1687] arXiv:2403.12055 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Deep learning based detection of collateral circulation in coronary angiographiesCosmin-Andrei Hatfaludi , Daniel Bunescu , Costin Florian Ciusdel , Alex Serban , Karl Bose , Marc Oppel , Stephanie Schroder , Christopher Seehase , Harald F. Langer , Jeanette Erdmann , Henry Nording , Lucian Mihai ItuSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Coronary artery disease (CAD) is the dominant cause of death and hospitalization across the globe. Atherosclerosis, an inflammatory condition that gradually narrows arteries and has potentially fatal effects, is the most frequent cause of CAD. Nonetheless, the circulation regularly adapts in the presence of atherosclerosis, through the formation of collateral arteries, resulting in significant long-term health benefits. Therefore, timely detection of coronary collateral circulation (CCC) is crucial for CAD personalized medicine. We propose a novel deep learning based method to detect CCC in angiographic images. Our method relies on a convolutional backbone to extract spatial features from each frame of an angiography sequence. The features are then concatenated, and subsequently processed by another convolutional layer that processes embeddings temporally. Due to scarcity of data, we also experiment with pretraining the backbone on coronary artery segmentation, which improves the results consistently. Moreover, we experiment with few-shot learning to further improve performance, given our low data regime. We present our results together with subgroup analyses based on Rentrop grading, collateral flow, and collateral grading, which provide valuable insights into model performance. Overall, the proposed method shows promising results in detecting CCC, and can be further extended to perform landmark based CCC detection and CCC quantification.
- [1688] arXiv:2403.12058 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Water-Based Metaheuristics: How Water Dynamics Can Help Us to Solve NP-Hard ProblemsComments: 14 pages, 0 figures, published in journal Complexity, 2019Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Many water-based optimization metaheuristics have been introduced during the last decade, both for combinatorial and for continuous optimization. Despite the strong similarities of these methods in terms of their underlying natural metaphors (most of them emulate, in some way or another, how drops collaboratively form paths down to the sea), in general the resulting algorithms are quite different in terms of their searching approach or their solution construction approach. For instance, each entity may represent a solution by itself or, alternatively, entities may construct solutions by modifying the landscape while moving. A researcher or practitioner could assume that the degree of similarity between two water-based metaheuristics heavily depends on the similarity of the natural water mechanics they emulate, but this is not the case. In order to bring some clarity to this mosaic of apparently related metaheuristics, in this paper we introduce them, explain their mechanics, and highlight their differences.
- [1689] arXiv:2403.12061 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Design-Space Exploration of SNN Models using Application-Specific Multi-Core ArchitecturesComments: Abstract Presentation in 2023 Neuro-Inspired Computing Elements (NICE) ConferenceSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: With the motivation and the difficulties that currently exist in comprehending and utilizing the promising features of SNNs, we proposed a novel run-time multi-core architecture-based simulator called "RAVSim" (Runtime Analysis and Visualization Simulator), a cutting-edge SNN simulator, developed using LabVIEW and it is publicly available on their website as an official module. RAVSim is a runtime virtual simulation environment tool that enables the user to interact with the model, observe its behavior of output concentration, and modify the set of parametric values at any time while the simulation is in execution. Recently some popular tools have been presented, but we believe that none of the tools allow users to interact with the model simulation in run time.
- [1690] arXiv:2403.12069 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Fairness Evaluation for Uplift Modeling in the Absence of Ground TruthComments: IEEE International Conference on Machine Learning and Applications (IEEE ICMLA)Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The acceleration in the adoption of AI-based automated decision-making systems poses a challenge for evaluating the fairness of algorithmic decisions, especially in the absence of ground truth. When designing interventions, uplift modeling is used extensively to identify candidates that are likely to benefit from treatment. However, these models remain particularly susceptible to fairness evaluation due to the lack of ground truth on the outcome measure since a candidate cannot be in both treatment and control simultaneously. In this article, we propose a framework that overcomes the missing ground truth problem by generating surrogates to serve as a proxy for counterfactual labels of uplift modeling campaigns. We then leverage the surrogate ground truth to conduct a more comprehensive binary fairness evaluation. We show how to apply the approach in a comprehensive study from a real-world marketing campaign for promotional offers and demonstrate its enhancement for fairness evaluation.
- [1691] arXiv:2403.12071 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Tailoring Education with GenAI: A New Horizon in Lesson PlanningComments: Abstract accepted for EDUCON 2024 (IEEE Global Engineering Education Conference 2024)Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: The advent of Generative AI (GenAI) in education presents a transformative approach to traditional teaching methodologies, which often overlook the diverse needs of individual students. This study introduces a GenAI tool, based on advanced natural language processing, designed as a digital assistant for educators, enabling the creation of customized lesson plans. The tool utilizes an innovative feature termed 'interactive mega-prompt,' a comprehensive query system that allows educators to input detailed classroom specifics such as student demographics, learning objectives, and preferred teaching styles. This input is then processed by the GenAI to generate tailored lesson plans. To evaluate the tool's effectiveness, a comprehensive methodology incorporating both quantitative (i.e., % of time savings) and qualitative (i.e., user satisfaction) criteria was implemented, spanning various subjects and educational levels, with continuous feedback collected from educators through a structured evaluation form. Preliminary results show that educators find the GenAI-generated lesson plans effective, significantly reducing lesson planning time and enhancing the learning experience by accommodating diverse student needs. This AI-driven approach signifies a paradigm shift in education, suggesting its potential applicability in broader educational contexts, including special education needs (SEN), where individualized attention and specific learning aids are paramount
- [1692] arXiv:2403.12073 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Feasibility of Social-Network-Based eHealth Intervention on the Improvement of Healthy Habits among ChildrenJosé Alberto Benítez-Andrades , Natalia Arias , María Teresa García-Ordás , Marta Martínez-Martínez , Isaías García-RodríguezJournal-ref: Sensors 2020, 20(5), 1404Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: This study shows the feasibility of an eHealth solution for tackling eating habits and physical activity in the adolescent population. The participants were children from 11 to 15 years old. An intervention was carried out on 139 students in the intervention group and 91 students in the control group, in two schools during 14 weeks. The intervention group had access to the web through a user account and a password. They were able to create friendship relationships, post comments, give likes and interact with other users, as well as receive notifications and information about nutrition and physical activity on a daily basis and get (virtual) rewards for improving their habits. The control group did not have access to any of these features. The homogeneity of the samples in terms of gender, age, body mass index and initial health-related habits was demonstrated. Pre- and post-measurements were collected through self-reports on the application website. After applying multivariate analysis of variance, a significant alteration in the age-adjusted body mass index percentile was observed in the intervention group versus the control group, as well as in the PAQ-A score and the KIDMED score. It can be concluded that eHealth interventions can help to obtain healthy habits. More research is needed to examine the effectiveness in achieving adherence to these new habits.
- [1693] arXiv:2403.12075 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image GenerationJessica Quaye , Alicia Parrish , Oana Inel , Charvi Rastogi , Hannah Rose Kirk , Minsuk Kahng , Erin van Liemt , Max Bartolo , Jess Tsang , Justin White , Nathan Clement , Rafael Mosquera , Juan Ciro , Vijay Janapa Reddi , Lora AroyoComments: 15 pages, 6 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativity is well-suited to uncover. To this end, we built the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing a diverse set of implicitly adversarial prompts. We have assembled a suite of state-of-the-art T2I models, employed a simple user interface to identify and annotate harms, and engaged diverse populations to capture long-tail safety issues that may be overlooked in standard testing. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models.
In this paper, we present an in-depth account of our methodology, a systematic study of novel attack strategies and discussion of safety failures revealed by challenge participants. We also release a companion visualization tool for easy exploration and derivation of insights from the dataset. The first challenge round resulted in over 10k prompt-image pairs with machine annotations for safety. A subset of 1.5k samples contains rich human annotations of harm types and attack styles. We find that 14% of images that humans consider harmful are mislabeled as ``safe'' by machines. We have identified new attack strategies that highlight the complexity of ensuring T2I model robustness. Our findings emphasize the necessity of continual auditing and adaptation as new vulnerabilities emerge. We are confident that this work will enable proactive, iterative safety assessments and promote responsible development of T2I models. - [1694] arXiv:2403.12076 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Neuron-centric Hebbian LearningComments: Accepted at Genetic and Evolutionary Computation Conference (GECCO 2024)Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: One of the most striking capabilities behind the learning mechanisms of the brain is the adaptation, through structural and functional plasticity, of its synapses. While synapses have the fundamental role of transmitting information across the brain, several studies show that it is the neuron activations that produce changes on synapses. Yet, most plasticity models devised for artificial Neural Networks (NNs), e.g., the ABCD rule, focus on synapses, rather than neurons, therefore optimizing synaptic-specific Hebbian parameters. This approach, however, increases the complexity of the optimization process since each synapse is associated to multiple Hebbian parameters. To overcome this limitation, we propose a novel plasticity model, called Neuron-centric Hebbian Learning (NcHL), where optimization focuses on neuron- rather than synaptic-specific Hebbian parameters. Compared to the ABCD rule, NcHL reduces the parameters from $5W$ to $5N$, being $W$ and $N$ the number of weights and neurons, and usually $N \ll W$. We also devise a ``weightless'' NcHL model, which requires less memory by approximating the weights based on a record of neuron activations. Our experiments on two robotic locomotion tasks reveal that NcHL performs comparably to the ABCD rule, despite using up to $\sim97$ times less parameters, thus allowing for scalable plasticity
- [1695] arXiv:2403.12077 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Evaluating Robustness of Generative Search Engine on Adversarial Factual QuestionsXuming Hu , Xiaochuan Li , Junzhe Chen , Yinghui Li , Yangning Li , Xiaoguang Li , Yasheng Wang , Qun Liu , Lijie Wen , Philip S. Yu , Zhijiang GuoComments: 21 pages, 7 figures, 4 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Generative search engines have the potential to transform how people seek information online, but generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Nonetheless, retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable part of a claim. To this end, we propose evaluating the robustness of generative search engines in the realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning incorrect responses. Through a comprehensive human evaluation of various generative search engines, such as Bing Chat, PerplexityAI, and YouChat across diverse queries, we demonstrate the effectiveness of adversarial factual questions in inducing incorrect responses. Moreover, retrieval-augmented generation exhibits a higher susceptibility to factual errors compared to LLMs without retrieval. These findings highlight the potential security risks of these systems and emphasize the need for rigorous evaluation before deployment.
- [1696] arXiv:2403.12082 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Boy Who Survived: Removing Harry Potter from an LLM is harder than reportedComments: 2 pages, 4 pages of appendix. Comment on arXiv:2310.02238Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''
- [1697] arXiv:2403.12090 (cross-list from cs.IR) [ pdf , ps , other ]
-
Title: Foundation Models and Information Retrieval in Digital PathologyComments: This is the preprint of a book chapter to appear in "Artificial Intelligence in Pathology" by Stanley Cohen and Chhavi ChauhanSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: The paper reviews the state-of-the-art of foundation models, LLMs, generative AI, information retrieval and CBIR in digital pathology
- [1698] arXiv:2403.12092 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Methods for Matching English Language AddressesSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Addresses occupy a niche location within the landscape of textual data, due to the positional importance carried by every word, and the geographical scope it refers to. The task of matching addresses happens everyday and is present in various fields like mail redirection, entity resolution, etc. Our work defines, and formalizes a framework to generate matching and mismatching pairs of addresses in the English language, and use it to evaluate various methods to automatically perform address matching. These methods vary widely from distance based approaches to deep learning models. By studying the Precision, Recall and Accuracy metrics of these approaches, we obtain an understanding of the best suited method for this setting of the address matching task.
- [1699] arXiv:2403.12093 (cross-list from econ.TH) [ pdf , ps , html , other ]
-
Title: Learning Macroeconomic Policies based on Microfoundations: A Stackelberg Mean Field Game ApproachComments: 15 pages, 7 figures, 3 tablesSubjects: Theoretical Economics (econ.TH) ; Artificial Intelligence (cs.AI)
Abstract: Effective macroeconomic policies play a crucial role in promoting economic growth and social stability. This paper models the optimal macroeconomic policy problem based on the \textit{Stackelberg Mean Field Game} (SMFG), where the government acts as the leader in policy-making, and large-scale households dynamically respond as followers. This modeling method captures the asymmetric dynamic game between the government and large-scale households, and interpretably evaluates the effects of macroeconomic policies based on microfoundations, which is difficult for existing methods to achieve. We also propose a solution for SMFGs, incorporating pre-training on real data and a model-free \textit{Stackelberg mean-field reinforcement learning }(SMFRL) algorithm, which operates independently of prior environmental knowledge and transitions. Our experimental results showcase the superiority of the SMFG method over other economic policies in terms of performance, efficiency-equity tradeoff, and SMFG assumption analysis. This paper significantly contributes to the domain of AI for economics by providing a powerful tool for modeling and solving optimal macroeconomic policies.
- [1700] arXiv:2403.12096 (cross-list from cs.IR) [ pdf , ps , other ]
-
Title: Enriching User Shopping History: Empowering E-commerce with a Hierarchical Recommendation SystemSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Recommendation systems can provide accurate recommendations by analyzing user shopping history. A richer user history results in more accurate recommendations. However, in real applications, users prefer e-commerce platforms where the item they seek is at the lowest price. In other words, most users shop from multiple e-commerce platforms simultaneously; different parts of the user's shopping history are shared between different e-commerce platforms. Consequently, we assume in this study that any e-commerce platform has a complete record of the user's history but can only access some parts of it. If a recommendation system is able to predict the missing parts first and enrich the user's shopping history properly, it will be possible to recommend the next item more accurately. Our recommendation system leverages user shopping history to improve prediction accuracy. The proposed approach shows significant improvements in both NDCG@10 and HR@10.
- [1701] arXiv:2403.12098 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Deep Generative Design for Mass ProductionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: Generative Design (GD) has evolved as a transformative design approach, employing advanced algorithms and AI to create diverse and innovative solutions beyond traditional constraints. Despite its success, GD faces significant challenges regarding the manufacturability of complex designs, often necessitating extensive manual modifications due to limitations in standard manufacturing processes and the reliance on additive manufacturing, which is not ideal for mass production. Our research introduces an innovative framework addressing these manufacturability concerns by integrating constraints pertinent to die casting and injection molding into GD, through the utilization of 2D depth images. This method simplifies intricate 3D geometries into manufacturable profiles, removing unfeasible features such as non-manufacturable overhangs and allowing for the direct consideration of essential manufacturing aspects like thickness and rib design. Consequently, designs previously unsuitable for mass production are transformed into viable solutions. We further enhance this approach by adopting an advanced 2D generative model, which offer a more efficient alternative to traditional 3D shape generation methods. Our results substantiate the efficacy of this framework, demonstrating the production of innovative, and, importantly, manufacturable designs. This shift towards integrating practical manufacturing considerations into GD represents a pivotal advancement, transitioning from purely inspirational concepts to actionable, production-ready solutions. Our findings underscore usefulness and potential of GD for broader industry adoption, marking a significant step forward in aligning GD with the demands of manufacturing challenges.
- [1702] arXiv:2403.12100 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Learning Time Slot Preferences via Mobility Tree for Next POI RecommendationSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Next Point-of-Interests (POIs) recommendation task aims to provide a dynamic ranking of POIs based on users' current check-in trajectories. The recommendation performance of this task is contingent upon a comprehensive understanding of users' personalized behavioral patterns through Location-based Social Networks (LBSNs) data. While prior studies have adeptly captured sequential patterns and transitional relationships within users' check-in trajectories, a noticeable gap persists in devising a mechanism for discerning specialized behavioral patterns during distinct time slots, such as noon, afternoon, or evening. In this paper, we introduce an innovative data structure termed the ``Mobility Tree'', tailored for hierarchically describing users' check-in records. The Mobility Tree encompasses multi-granularity time slot nodes to learn user preferences across varying temporal periods. Meanwhile, we propose the Mobility Tree Network (MTNet), a multitask framework for personalized preference learning based on Mobility Trees. We develop a four-step node interaction operation to propagate feature information from the leaf nodes to the root node. Additionally, we adopt a multitask training strategy to push the model towards learning a robust representation. The comprehensive experimental results demonstrate the superiority of MTNet over ten state-of-the-art next POI recommendation models across three real-world LBSN datasets, substantiating the efficacy of time slot preference learning facilitated by Mobility Tree.
- [1703] arXiv:2403.12107 (cross-list from econ.GN) [ pdf , ps , html , other ]
-
Title: Scenarios for the Transition to AGISubjects: General Economics (econ.GN) ; Artificial Intelligence (cs.AI)
Abstract: We analyze how output and wages behave under different scenarios for technological progress that may culminate in Artificial General Intelligence (AGI), defined as the ability of AI systems to perform all tasks that humans can perform. We assume that human work can be decomposed into atomistic tasks that differ in their complexity. Advances in technology make ever more complex tasks amenable to automation. The effects on wages depend on a race between automation and capital accumulation. If the distribution of task complexity exhibits a sufficiently thick infinite tail, then there is always enough work for humans, and wages may rise forever. By contrast, if the complexity of tasks that humans can perform is bounded and full automation is reached, then wages collapse. But declines may occur even before if large-scale automation outpaces capital accumulation and makes labor too abundant. Automating productivity growth may lead to broad-based gains in the returns to all factors. By contrast, bottlenecks to growth from irreproducible scarce factors may exacerbate the decline in wages.
- [1704] arXiv:2403.12109 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: GCAM: Gaussian and causal-attention model of food fine-grained recognitionComments: 23 pages, 11 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Currently, most food recognition relies on deep learning for category classification. However, these approaches struggle to effectively distinguish between visually similar food samples, highlighting the pressing need to address fine-grained issues in food recognition. To mitigate these challenges, we propose the adoption of a Gaussian and causal-attention model for fine-grained object this http URL particular, we train to obtain Gaussian features over target regions, followed by the extraction of fine-grained features from the objects, thereby enhancing the feature mapping capabilities of the target regions. To counteract data drift resulting from uneven data distributions, we employ a counterfactual reasoning approach. By using counterfactual interventions, we analyze the impact of the learned image attention mechanism on network predictions, enabling the network to acquire more useful attention weights for fine-grained image recognition. Finally, we design a learnable loss strategy to balance training stability across various modules, ultimately improving the accuracy of the final target recognition. We validate our approach on four relevant datasets, demonstrating its excellent performance across these four datasets.We experimentally show that GCAM surpasses state-of-the-art methods on the ETH-FOOD101, UECFOOD256, and Vireo-FOOD172 datasets. Furthermore, our approach also achieves state-of-the-art performance on the CUB-200 dataset.
- [1705] arXiv:2403.12114 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Safety Analysis of Autonomous Railway Systems: An Introduction to the SACRED MethodologyComments: S. Bernardi, T. Zoppi (Editors), "Fast Abstracts and Student Forum Proceedings - EDCC 2024 - 19th European Dependable Computing Conference, Leuven, Belgium, 8-11 April 2024"Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: As the railway industry increasingly seeks to introduce autonomy and machine learning (ML), several questions arise. How can safety be assured for such systems and technologies? What is the applicability of current safety standards within this new technological landscape? What are the key metrics to classify a system as safe? Currently, safety analysis for the railway reflects the failure modes of existing technology; in contrast, the primary concern of analysis of automation is typically average performance. Such purely statistical approaches to measuring ML performance are limited, as they may overlook classes of situations that may occur rarely but in which the function performs consistently poorly. To combat these difficulties we introduce SACRED, a safety methodology for producing an initial safety case and determining important safety metrics for autonomous systems. The development of SACRED is motivated by the proposed GoA-4 light-rail system in Berlin.
- [1706] arXiv:2403.12143 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Graph Neural Networks for Learning Equivariant Representations of Neural NetworksMiltiadis Kofinas , Boris Knyazev , Yan Zhang , Yunlu Chen , Gertjan J. Burghouts , Efstratios Gavves , Cees G. M. Snoek , David W. ZhangComments: In ICLR 2024. Source code: this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. The source code is open-sourced at this https URL .
- [1707] arXiv:2403.12171 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: EasyJailbreak: A Unified Framework for Jailbreaking Large Language ModelsWeikang Zhou , Xiao Wang , Limao Xiong , Han Xia , Yingshuang Gu , Mingxu Chai , Fukang Zhu , Caishuang Huang , Shihan Dou , Zhiheng Xi , Rui Zheng , Songyang Gao , Yicheng Zou , Hang Yan , Yifan Le , Ruohui Wang , Lijun Li , Jing Shao , Tao Gui , Qi Zhang , Xuanjing HuangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.
- [1708] arXiv:2403.12172 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly DetectionComments: 18 pages, 2 figures, 6 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Skeleton-based video anomaly detection (SVAD) is a crucial task in computer vision. Accurately identifying abnormal patterns or events enables operators to promptly detect suspicious activities, thereby enhancing safety. Achieving this demands a comprehensive understanding of human motions, both at body and region levels, while also accounting for the wide variations of performing a single action. However, existing studies fail to simultaneously address these crucial properties. This paper introduces a novel, practical and lightweight framework, namely Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection (GiCiSAD) to overcome the challenges associated with SVAD. GiCiSAD consists of three novel modules: the Graph Attention-based Forecasting module to capture the spatio-temporal dependencies inherent in the data, the Graph-level Jigsaw Puzzle Maker module to distinguish subtle region-level discrepancies between normal and abnormal motions, and the Graph-based Conditional Diffusion model to generate a wide spectrum of human motions. Extensive experiments on four widely used skeleton-based video datasets show that GiCiSAD outperforms existing methods with significantly fewer training parameters, establishing it as the new state-of-the-art.
- [1709] arXiv:2403.12173 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: TnT-LLM: Text Mining at Scale with Large Language ModelsMengting Wan , Tara Safavi , Sujay Kumar Jauhar , Yujin Kim , Scott Counts , Jennifer Neville , Siddharth Suri , Chirag Shah , Ryen W White , Longqi Yang , Reid Andersen , Georg Buscher , Dhruv Joshi , Nagu RanganComments: 9 pages main content, 8 pages references and appendixSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.
- [1710] arXiv:2403.12176 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Safety Implications of Explainable Artificial Intelligence in End-to-End Autonomous DrivingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: The end-to-end learning pipeline is gradually creating a paradigm shift in the ongoing development of highly autonomous vehicles, largely due to advances in deep learning, the availability of large-scale training datasets, and improvements in integrated sensor devices. However, a lack of interpretability in real-time decisions with contemporary learning methods impedes user trust and attenuates the widespread deployment and commercialization of such vehicles. Moreover, the issue is exacerbated when these cars are involved in or cause traffic accidents. Such drawback raises serious safety concerns from societal and legal perspectives. Consequently, explainability in end-to-end autonomous driving is essential to build trust in vehicular automation. However, the safety and explainability aspects of end-to-end driving have generally been investigated disjointly by researchers in today's state of the art. This survey aims to bridge the gaps between these topics and seeks to answer the following research question: When and how can explanations improve safety of end-to-end autonomous driving? In this regard, we first revisit established safety and state-of-the-art explainability techniques in end-to-end driving. Furthermore, we present three critical case studies and show the pivotal role of explanations in enhancing self-driving safety. Finally, we describe insights from empirical studies and reveal potential value, limitations, and caveats of practical explainable AI methods with respect to their safety assurance in end-to-end autonomous driving.
- [1711] arXiv:2403.12181 (cross-list from cs.GT) [ pdf , ps , html , other ]
-
Title: MAC Advice for Facility Location Mechanism DesignSubjects: Computer Science and Game Theory (cs.GT) ; Artificial Intelligence (cs.AI)
Abstract: Algorithms with predictions have attracted much attention in the last years across various domains, including variants of facility location, as a way to surpass traditional worst-case analyses. We study the $k$-facility location mechanism design problem, where the $n$ agents are strategic and might misreport their location.
Unlike previous models, where predictions are for the $k$ optimal facility locations, we receive $n$ predictions for the locations of each of the agents. However, these predictions are only "mostly" and "approximately" correct (or MAC for short) -- i.e., some $\delta$-fraction of the predicted locations are allowed to be arbitrarily incorrect, and the remainder of the predictions are allowed to be correct up to an $\varepsilon$-error. We make no assumption on the independence of the errors. Can such predictions allow us to beat the current best bounds for strategyproof facility location?
We show that the $1$-median (geometric median) of a set of points is naturally robust under corruptions, which leads to an algorithm for single-facility location with MAC predictions. We extend the robustness result to a "balanced" variant of the $k$ facilities case. Without balancedness, we show that robustness completely breaks down, even for the setting of $k=2$ facilities on a line. For this "unbalanced" setting, we devise a truthful random mechanism that outperforms the best known result of Lu et al. [2010], which does not use predictions. En route, we introduce the problem of "second" facility location (when the first facility's location is already fixed). Our findings on the robustness of the $1$-median and more generally $k$-medians may be of independent interest, as quantitative versions of classic breakdown-point results in robust statistics. - [1712] arXiv:2403.12196 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Shifting the Lens: Detecting Malware in npm Ecosystem with Large Language ModelsComments: 13 pages, 1 Figure, 7 tablesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The Gartner 2022 report predicts that 45% of organizations worldwide will encounter software supply chain attacks by 2025, highlighting the urgency to improve software supply chain security for community and national interests. Current malware detection techniques aid in the manual review process by filtering benign and malware packages, yet such techniques have high false-positive rates and limited automation support. Therefore, malware detection techniques could benefit from advanced, more automated approaches for accurate and minimally false-positive results. The goal of this study is to assist security analysts in identifying malicious packages through the empirical study of large language models (LLMs) to detect potential malware in the npm ecosystem.
We present SocketAI Scanner, a multi-stage decision-maker malware detection workflow using iterative self-refinement and zero-shot-role-play-Chain of Thought (CoT) prompting techniques for ChatGPT. We studied 5,115 npm packages (of which 2,180 are malicious) and performed a baseline comparison of the GPT-3 and GPT-4 models with a static analysis tool. Our findings showed promising results for GPT models with low misclassification alert rates. Our baseline comparison demonstrates a notable improvement over static analysis in precision scores above 25% and F1 scores above 15%. We attained precision and F1 scores of 91% and 94%, respectively, for the GPT-3 model. Overall, GPT-4 demonstrates superior performance in precision (99%) and F1 (97%) scores, while GPT-3 presents a cost-effective balance between performance and expenditure. - [1713] arXiv:2403.12197 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: E2F-Net: Eyes-to-Face Inpainting via StyleGAN Latent SpaceAhmad Hassanpour , Fatemeh Jamalbafrani , Bian Yang , Kiran Raja , Raymond Veldhuis , Julian FierrezSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Face inpainting, the technique of restoring missing or damaged regions in facial images, is pivotal for applications like face recognition in occluded scenarios and image analysis with poor-quality captures. This process not only needs to produce realistic visuals but also preserve individual identity characteristics. The aim of this paper is to inpaint a face given periocular region (eyes-to-face) through a proposed new Generative Adversarial Network (GAN)-based model called Eyes-to-Face Network (E2F-Net). The proposed approach extracts identity and non-identity features from the periocular region using two dedicated encoders have been used. The extracted features are then mapped to the latent space of a pre-trained StyleGAN generator to benefit from its state-of-the-art performance and its rich, diverse and expressive latent space without any additional training. We further improve the StyleGAN output to find the optimal code in the latent space using a new optimization for GAN inversion technique. Our E2F-Net requires a minimum training process reducing the computational complexity as a secondary benefit. Through extensive experiments, we show that our method successfully reconstructs the whole face with high quality, surpassing current techniques, despite significantly less training and supervision efforts. We have generated seven eyes-to-face datasets based on well-known public face datasets for training and verifying our proposed methods. The code and datasets are publicly available.
- [1714] arXiv:2403.12207 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Synthetic Image Generation in Cyber Influence Operations: An Emergent Threat?Comments: 44 pages, 56 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The evolution of artificial intelligence (AI) has catalyzed a transformation in digital content generation, with profound implications for cyber influence operations. This report delves into the potential and limitations of generative deep learning models, such as diffusion models, in fabricating convincing synthetic images. We critically assess the accessibility, practicality, and output quality of these tools and their implications in threat scenarios of deception, influence, and subversion. Notably, the report generates content for several hypothetical cyber influence operations to demonstrate the current capabilities and limitations of these AI-driven methods for threat actors. While generative models excel at producing illustrations and non-realistic imagery, creating convincing photo-realistic content remains a significant challenge, limited by computational resources and the necessity for human-guided refinement. Our exploration underscores the delicate balance between technological advancement and its potential for misuse, prompting recommendations for ongoing research, defense mechanisms, multi-disciplinary collaboration, and policy development. These recommendations aim to leverage AI's potential for positive impact while safeguarding against its risks to the integrity of information, especially in the context of cyber influence.
- [1715] arXiv:2403.12211 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Unified Model for Longitudinal Multi-Modal Multi-View Prediction with MissingnessSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Medical records often consist of different modalities, such as images, text, and tabular information. Integrating all modalities offers a holistic view of a patient's condition, while analyzing them longitudinally provides a better understanding of disease progression. However, real-world longitudinal medical records present challenges: 1) patients may lack some or all of the data for a specific timepoint, and 2) certain modalities or views might be absent for all patients during a particular period. In this work, we introduce a unified model for longitudinal multi-modal multi-view prediction with missingness. Our method allows as many timepoints as desired for input, and aims to leverage all available data, regardless of their availability. We conduct extensive experiments on the knee osteoarthritis dataset from the Osteoarthritis Initiative for pain and Kellgren-Lawrence grade prediction at a future timepoint. We demonstrate the effectiveness of our method by comparing results from our unified model to specific models that use the same modality and view combinations during training and evaluation. We also show the benefit of having extended temporal data and provide post-hoc analysis for a deeper understanding of each modality/view's importance for different tasks.
- [1716] arXiv:2403.12212 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Evaluating Named Entity Recognition: Comparative Analysis of Mono- and Multilingual Transformer Models on Brazilian Corporate Earnings Call TranscriptionsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Named Entity Recognition (NER) is a Natural Language Processing technique for extracting information from textual documents. However, much of the existing research on NER has been centered around English-language documents, leaving a gap in the availability of datasets tailored to the financial domain in Portuguese. This study addresses the need for NER within the financial domain, focusing on Portuguese-language texts extracted from earnings call transcriptions of Brazilian banks. By curating a comprehensive dataset comprising 384 transcriptions and leveraging weak supervision techniques for annotation, we evaluate the performance of monolingual models trained on Portuguese (BERTimbau and PTT5) and multilingual models (mBERT and mT5). Notably, we introduce a novel approach that reframes the token classification task as a text generation problem, enabling fine-tuning and evaluation of T5 models. Following the fine-tuning of the models, we conduct an evaluation on the test dataset, employing performance and error metrics. Our findings reveal that BERT-based models consistently outperform T5-based models. Furthermore, while the multilingual models exhibit comparable macro F1-scores, BERTimbau demonstrates superior performance over PTT5. A manual analysis of sentences generated by PTT5 and mT5 unveils a degree of similarity ranging from 0.89 to 1.0, between the original and generated sentences. However, critical errors emerge as both models exhibit discrepancies, such as alterations to monetary and percentage values, underscoring the importance of accuracy and consistency in the financial domain. Despite these challenges, PTT5 and mT5 achieve impressive macro F1-scores of 98.52% and 98.85%, respectively, with our proposed approach. Furthermore, our study sheds light on notable disparities in memory and time consumption for inference across the models.
- [1717] arXiv:2403.12237 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Efficient Transformer-based Hyper-parameter Optimization for Resource-constrained IoT EnvironmentsComments: 7 pages, Submitted to IEEE Internet of Things MagazineSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The hyper-parameter optimization (HPO) process is imperative for finding the best-performing Convolutional Neural Networks (CNNs). The automation process of HPO is characterized by its sizable computational footprint and its lack of transparency; both important factors in a resource-constrained Internet of Things (IoT) environment. In this paper, we address these problems by proposing a novel approach that combines transformer architecture and actor-critic Reinforcement Learning (RL) model, TRL-HPO, equipped with multi-headed attention that enables parallelization and progressive generation of layers. These assumptions are founded empirically by evaluating TRL-HPO on the MNIST dataset and comparing it with state-of-the-art approaches that build CNN models from scratch. The results show that TRL-HPO outperforms the classification results of these approaches by 6.8% within the same time frame, demonstrating the efficiency of TRL-HPO for the HPO process. The analysis of the results identifies the main culprit for performance degradation attributed to stacking fully connected layers. This paper identifies new avenues for improving RL-based HPO processes in resource-constrained environments.
- [1718] arXiv:2403.12242 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reference-based Metrics Disprove Themselves in Question GenerationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
- [1719] arXiv:2403.12297 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Leveraging Large Language Models to Extract Information on Substance Use Disorder Severity from Clinical Notes: A Zero-shot Learning ApproachMaria Mahbub , Gregory M. Dams , Sudarshan Srinivasan , Caitlin Rizy , Ioana Danciu , Jodie Trafton , Kathryn KnightComments: 10 pages, 4 figures, 2 tablesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Substance use disorder (SUD) poses a major concern due to its detrimental effects on health and society. SUD identification and treatment depend on a variety of factors such as severity, co-determinants (e.g., withdrawal symptoms), and social determinants of health. Existing diagnostic coding systems used by American insurance providers, like the International Classification of Diseases (ICD-10), lack granularity for certain diagnoses, but clinicians will add this granularity (as that found within the Diagnostic and Statistical Manual of Mental Disorders classification or DSM-5) as supplemental unstructured text in clinical notes. Traditional natural language processing (NLP) methods face limitations in accurately parsing such diverse clinical language. Large Language Models (LLMs) offer promise in overcoming these challenges by adapting to diverse language patterns. This study investigates the application of LLMs for extracting severity-related information for various SUD diagnoses from clinical notes. We propose a workflow employing zero-shot learning of LLMs with carefully crafted prompts and post-processing techniques. Through experimentation with Flan-T5, an open-source LLM, we demonstrate its superior recall compared to the rule-based approach. Focusing on 11 categories of SUD diagnoses, we show the effectiveness of LLMs in extracting severity information, contributing to improved risk assessment and treatment planning for SUD patients.
- [1720] arXiv:2403.12307 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Molecular Classification Using Hyperdimensional Graph ClassificationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
Abstract: Our work introduces an innovative approach to graph learning by leveraging Hyperdimensional Computing. Graphs serve as a widely embraced method for conveying information, and their utilization in learning has gained significant attention. This is notable in the field of chemoinformatics, where learning from graph representations plays a pivotal role. An important application within this domain involves the identification of cancerous cells across diverse molecular structures.
We propose an HDC-based model that demonstrates comparable Area Under the Curve results when compared to state-of-the-art models like Graph Neural Networks (GNNs) or the Weisfieler-Lehman graph kernel (WL). Moreover, it outperforms previously proposed hyperdimensional computing graph learning methods. Furthermore, it achieves noteworthy speed enhancements, boasting a 40x acceleration in the training phase and a 15x improvement in inference time compared to GNN and WL models. This not only underscores the efficacy of the HDC-based method, but also highlights its potential for expedited and resource-efficient graph learning. - [1721] arXiv:2403.12309 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Reinforcement Learning from Delayed Observations via World ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In standard Reinforcement Learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of RL algorithms. In this paper, we focus on addressing observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or even degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to %30. Moreover, we evaluate our methods on visual input based delayed environment, for the first time showcasing delay-aware reinforcement learning on visual observations.
- [1722] arXiv:2403.12320 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network TrainingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Efficient and biologically plausible alternatives to backpropagation in neural network training remain a challenge due to issues such as high computational complexity and additional assumptions about neural networks, which limit scalability to deeper networks. The likelihood ratio method offers a promising gradient estimation strategy but is constrained by significant memory consumption, especially when deploying multiple copies of data to reduce estimation variance. In this paper, we introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation. By exploiting the natural parallelism during the backward pass using LR, we further provide a high-performance training strategy, which pipelines both the forward and backward pass, to make it more suitable for the computation on specialized hardware. Extensive experiments demonstrate the effectiveness of the approximation technique in neural network training. This work underscores the potential of the likelihood ratio method in achieving high-performance neural network training, suggesting avenues for further exploration.
- [1723] arXiv:2403.12368 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Characteristic AI Agents via Large Language ModelsComments: COLING 2024,The benchmark is available at: this https URLSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The advancement of Large Language Models (LLMs) has led to significant enhancements in the performance of chatbot systems. Many researchers have dedicated their efforts to the development of bringing characteristics to chatbots. While there have been commercial products for developing role-driven chatbots using LLMs, it is worth noting that academic research in this area remains relatively scarce. Our research focuses on investigating the performance of LLMs in constructing Characteristic AI Agents by simulating real-life individuals across different settings. Current investigations have primarily focused on act on roles with simple profiles. In response to this research gap, we create a benchmark for the characteristic AI agents task, including dataset, techniques, and evaluation metrics. A dataset called ``Character100'' is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play. With the constructed dataset, we conduct comprehensive assessment of LLMs across various settings. In addition, we devise a set of automatic metrics for quantitative performance evaluation. The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents. The benchmark is available at this https URL .
- [1724] arXiv:2403.12386 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Pipelined Biomedical Event Extraction Rivaling Joint LearningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Biomedical event extraction is an information extraction task to obtain events from biomedical text, whose targets include the type, the trigger, and the respective arguments involved in an event. Traditional biomedical event extraction usually adopts a pipelined approach, which contains trigger identification, argument role recognition, and finally event construction either using specific rules or by machine learning. In this paper, we propose an n-ary relation extraction method based on the BERT pre-training model to construct Binding events, in order to capture the semantic information about an event's context and its participants. The experimental results show that our method achieves promising results on the GE11 and GE13 corpora of the BioNLP shared task with F1 scores of 63.14% and 59.40%, respectively. It demonstrates that by significantly improving theperformance of Binding events, the overall performance of the pipelined event extraction approach or even exceeds those of current joint learning methods.
- [1725] arXiv:2403.12388 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Interpretable User Satisfaction Estimation for Conversational Systems with Large Language ModelsYing-Chun Lin , Jennifer Neville , Jack W. Stokes , Longqi Yang , Tara Safavi , Mengting Wan , Scott Counts , Siddharth Suri , Reid Andersen , Xiaofeng Xu , Deepak Gupta , Sujay Kumar Jauhar , Xia Song , Georg Buscher , Saurabh Tiwary , Brent Hecht , Jaime TeevanSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. The resulting method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.
- [1726] arXiv:2403.12391 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FairSTG: Countering performance heterogeneity via collaborative sample-level optimizationComments: Under review by IEEE Transactions on Mobile ComputingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Spatiotemporal learning plays a crucial role in mobile computing techniques to empower smart cites. While existing research has made great efforts to achieve accurate predictions on the overall dataset, they still neglect the significant performance heterogeneity across samples. In this work, we designate the performance heterogeneity as the reason for unfair spatiotemporal learning, which not only degrades the practical functions of models, but also brings serious potential risks to real-world urban applications. To fix this gap, we propose a model-independent Fairness-aware framework for SpatioTemporal Graph learning (FairSTG), which inherits the idea of exploiting advantages of well-learned samples to challenging ones with collaborative mix-up. Specifically, FairSTG consists of a spatiotemporal feature extractor for model initialization, a collaborative representation enhancement for knowledge transfer between well-learned samples and challenging ones, and fairness objectives for immediately suppressing sample-level performance heterogeneity. Experiments on four spatiotemporal datasets demonstrate that our FairSTG significantly improves the fairness quality while maintaining comparable forecasting accuracy. Case studies show FairSTG can counter both spatial and temporal performance heterogeneity by our sample-level retrieval and compensation, and our work can potentially alleviate the risks on spatiotemporal resource allocation for underrepresented urban regions.
- [1727] arXiv:2403.12392 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: AraPoemBERT: A Pretrained Language Model for Arabic Poetry AnalysisComments: 28 pages, 11 figures, not published yetSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Arabic poetry, with its rich linguistic features and profound cultural significance, presents a unique challenge to the Natural Language Processing (NLP) field. The complexity of its structure and context necessitates advanced computational models for accurate analysis. In this paper, we introduce AraPoemBERT, an Arabic language model pretrained exclusively on Arabic poetry text. To demonstrate the effectiveness of the proposed model, we compared AraPoemBERT with 5 different Arabic language models on various NLP tasks related to Arabic poetry. The new model outperformed all other models and achieved state-of-the-art results in most of the downstream tasks. AraPoemBERT achieved unprecedented accuracy in two out of three novel tasks: poet's gender classification (99.34\% accuracy), and poetry sub-meter classification (97.79\% accuracy). In addition, the model achieved an accuracy score in poems' rhyme classification (97.73\% accuracy) which is almost equivalent to the best score reported in this study. Moreover, the proposed model significantly outperformed previous work and other comparative models in the tasks of poems' sentiment analysis, achieving an accuracy of 78.95\%, and poetry meter classification (99.03\% accuracy), while significantly expanding the scope of these two problems. The dataset used in this study, contains more than 2.09 million verses collected from online sources, each associated with various attributes such as meter, sub-meter, poet, rhyme, and topic. The results demonstrate the effectiveness of the proposed model in understanding and analyzing Arabic poetry, achieving state-of-the-art results in several tasks and outperforming previous works and other language models included in the study. AraPoemBERT model is publicly available on \url{ this https URL }.
- [1728] arXiv:2403.12400 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Finding the Missing Data: A BERT-inspired Approach Against Package Loss in Wireless SensingComments: 6 pages, accepted by IEEE INFOCOM Deepwireless Workshop 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: Despite the development of various deep learning methods for Wi-Fi sensing, package loss often results in noncontinuous estimation of the Channel State Information (CSI), which negatively impacts the performance of the learning models. To overcome this challenge, we propose a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) for CSI recovery, named CSI-BERT. CSI-BERT can be trained in an self-supervised manner on the target dataset without the need for additional data. Furthermore, unlike traditional interpolation methods that focus on one subcarrier at a time, CSI-BERT captures the sequential relationships across different subcarriers. Experimental results demonstrate that CSI-BERT achieves lower error rates and faster speed compared to traditional interpolation methods, even when facing with high loss rates. Moreover, by harnessing the recovered CSI obtained from CSI-BERT, other deep learning models like Residual Network and Recurrent Neural Network can achieve an average increase in accuracy of approximately 15\% in Wi-Fi sensing tasks. The collected dataset WiGesture and code for our model are publicly available at this https URL .
- [1729] arXiv:2403.12403 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Towards Interpretable Hate Speech Detection using Large Language Model-extracted RationalesComments: Camera-ready for NAACL WOAH 2024 (Workshop on Online Abuse and Harms). First two authors contributed equallySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Although social media platforms are a prominent arena for users to engage in interpersonal discussions and express opinions, the facade and anonymity offered by social media may allow users to spew hate speech and offensive content. Given the massive scale of such platforms, there arises a need to automatically identify and flag instances of hate speech. Although several hate speech detection methods exist, most of these black-box methods are not interpretable or explainable by design. To address the lack of interpretability, in this paper, we propose to use state-of-the-art Large Language Models (LLMs) to extract features in the form of rationales from the input text, to train a base hate speech classifier, thereby enabling faithful interpretability by design. Our framework effectively combines the textual understanding capabilities of LLMs and the discriminative power of state-of-the-art hate speech classifiers to make these classifiers faithfully interpretable. Our comprehensive evaluation on a variety of English language social media hate speech datasets demonstrate: (1) the goodness of the LLM-extracted rationales, and (2) the surprising retention of detector performance even after training to ensure interpretability. All code and data will be made available at this https URL .
- [1730] arXiv:2403.12418 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: STG-Mamba: Spatial-Temporal Graph Learning via Selective State Space ModelSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Spatial-Temporal Graph (STG) data is characterized as dynamic, heterogenous, and non-stationary, leading to the continuous challenge of spatial-temporal graph learning. In the past few years, various GNN-based methods have been proposed to solely focus on mimicking the relationships among node individuals of the STG network, ignoring the significance of modeling the intrinsic features that exist in STG system over time. In contrast, modern Selective State Space Models (SSSMs) present a new approach which treat STG Network as a system, and meticulously explore the STG system's dynamic state evolution across temporal dimension. In this work, we introduce Spatial-Temporal Graph Mamba (STG-Mamba) as the first exploration of leveraging the powerful selective state space models for STG learning by treating STG Network as a system, and employing the Graph Selective State Space Block (GS3B) to precisely characterize the dynamic evolution of STG networks. STG-Mamba is formulated as an Encoder-Decoder architecture, which takes GS3B as the basic module, for efficient sequential data modeling. Furthermore, to strengthen GNN's ability of modeling STG data under the setting of SSSMs, we propose Kalman Filtering Graph Neural Networks (KFGN) for adaptive graph structure upgrading. KFGN smoothly fits in the context of selective state space evolution, and at the same time keeps linear complexity. Extensive empirical studies are conducted on three benchmark STG forecasting datasets, demonstrating the performance superiority and computational efficiency of STG-Mamba. It not only surpasses existing state-of-the-art methods in terms of STG forecasting performance, but also effectively alleviate the computational bottleneck of large-scale graph networks in reducing the computational cost of FLOPs and test inference time.
- [1731] arXiv:2403.12431 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Geometric Constraints in Deep Learning Frameworks: A SurveyComments: A preprintSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Stereophotogrammetry is an emerging technique of scene understanding. Its origins go back to at least the 1800s when people first started to investigate using photographs to measure the physical properties of the world. Since then, thousands of approaches have been explored. The classic geometric techniques of Shape from Stereo is built on using geometry to define constraints on scene and camera geometry and then solving the non-linear systems of equations. More recent work has taken an entirely different approach, using end-to-end deep learning without any attempt to explicitly model the geometry. In this survey, we explore the overlap for geometric-based and deep learning-based frameworks. We compare and contrast geometry enforcing constraints integrated into a deep learning framework for depth estimation or other closely related problems. We present a new taxonomy for prevalent geometry enforcing constraints used in modern deep learning frameworks. We also present insightful observations and potential future research directions.
- [1732] arXiv:2403.12448 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Do Generated Data Always Help Contrastive Learning?Comments: 19 pages. Accepted by ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Abstract: Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at this https URL .
- [1733] arXiv:2403.12459 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Non-negative Contrastive LearningComments: 22 pages. Accepted by ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Abstract: Deep representations have shown promising performance when transferred to downstream tasks in a black-box manner. Yet, their inherent lack of interpretability remains a significant challenge, as these features are often opaque to human understanding. In this paper, we propose Non-negative Contrastive Learning (NCL), a renaissance of Non-negative Matrix Factorization (NMF) aimed at deriving interpretable features. The power of NCL lies in its enforcement of non-negativity constraints on features, reminiscent of NMF's capability to extract features that align closely with sample clusters. NCL not only aligns mathematically well with an NMF objective but also preserves NMF's interpretability attributes, resulting in a more sparse and disentangled representation compared to standard contrastive learning (CL). Theoretically, we establish guarantees on the identifiability and downstream generalization of NCL. Empirically, we show that these advantages enable NCL to outperform CL significantly on feature disentanglement, feature selection, as well as downstream classification tasks. At last, we show that NCL can be easily extended to other learning scenarios and benefit supervised learning as well. Code is available at this https URL .
- [1734] arXiv:2403.12462 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Topological Representations of Heterogeneous Learning Dynamics of Recurrent Spiking Neural NetworksComments: Accepted in IEEE World Congress on Computational Intelligence (IEEE WCCI) 2024Subjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Spiking Neural Networks (SNNs) have become an essential paradigm in neuroscience and artificial intelligence, providing brain-inspired computation. Recent advances in literature have studied the network representations of deep neural networks. However, there has been little work that studies representations learned by SNNs, especially using unsupervised local learning methods like spike-timing dependent plasticity (STDP). Recent work by \cite{barannikov2021representation} has introduced a novel method to compare topological mappings of learned representations called Representation Topology Divergence (RTD). Though useful, this method is engineered particularly for feedforward deep neural networks and cannot be used for recurrent networks like Recurrent SNNs (RSNNs). This paper introduces a novel methodology to use RTD to measure the difference between distributed representations of RSNN models with different learning methods. We propose a novel reformulation of RSNNs using feedforward autoencoder networks with skip connections to help us compute the RTD for recurrent networks. Thus, we investigate the learning capabilities of RSNN trained using STDP and the role of heterogeneity in the synaptic dynamics in learning such representations. We demonstrate that heterogeneous STDP in RSNNs yield distinct representations than their homogeneous and surrogate gradient-based supervised learning counterparts. Our results provide insights into the potential of heterogeneous SNN models, aiding the development of more efficient and biologically plausible hybrid artificial intelligence systems.
- [1735] arXiv:2403.12463 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Reinforcement learning based local path planning for mobile robotComments: 5 Pages, 10 figures, Presented in; Interdisciplinary Conference on Mechanics, Computers and Electrics ANKARA/TURKEY 27-28 November 2021Journal-ref: Interdisciplinary Conference on Mechanics, Computers and Electrics, 27-28 Nov. 2021, AnkaraSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Different methods are used for a mobile robot to go to a specific target location. These methods work in different ways for online and offline scenarios. In the offline scenario, an environment map is created once, and fixed path planning is made on this map to reach the target. Path planning algorithms such as A* and RRT (Rapidly-Exploring Random Tree) are the examples of offline methods. The most obvious situation here is the need to re-plan the path for changing conditions of the loaded map. On the other hand, in the online scenario, the robot moves dynamically to a given target without using a map by using the perceived data coming from the sensors. Approaches such as SFM (Social Force Model) are used in online systems. However, these methods suffer from the requirement of a lot of dynamic sensing data. Thus, it can be said that the need for re-planning and mapping in offline systems and various system design requirements in online systems are the subjects that focus on autonomous mobile robot research. Recently, deep neural network powered Q-Learning methods are used as an emerging solution to the aforementioned problems in mobile robot navigation. In this study, machine learning algorithms with deep Q-Learning (DQN) and Deep DQN architectures, are evaluated for the solution of the problems presented above to realize path planning of an autonomous mobile robot to avoid obstacles.
- [1736] arXiv:2403.12486 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: NTK-Guided Few-Shot Class Incremental LearningSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: While anti-amnesia FSCIL learners often excel in incremental sessions, they tend to prioritize mitigating knowledge attrition over harnessing the model's potential for knowledge acquisition. In this paper, we delve into the foundations of model generalization in FSCIL through the lens of the Neural Tangent Kernel (NTK). Our primary design focus revolves around ensuring optimal NTK convergence and NTK-related generalization error, serving as the theoretical bedrock for exceptional generalization. To attain globally optimal NTK convergence, we employ a meta-learning mechanism grounded in mathematical principles to guide the optimization process within an expanded network. Furthermore, to reduce the NTK-related generalization error, we commence from the foundational level, optimizing the relevant factors constituting its generalization loss. Specifically, we initiate self-supervised pre-training on the base session to shape the initial network weights. Then they are carefully refined through curricular alignment, followed by the application of dual NTK regularization tailored specifically for both convolutional and linear layers. Through the combined effects of these measures, our network acquires robust NTK properties, significantly enhancing its foundational generalization. On popular FSCIL benchmark datasets, our NTK-FSCIL surpasses contemporary state-of-the-art approaches, elevating end-session accuracy by 2.9% to 8.7%.
- [1737] arXiv:2403.12488 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLMYixuan Wu , Yizhou Wang , Shixiang Tang , Wenhao Wu , Tong He , Wanli Ouyang , Jian Wu , Philip TorrSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting.
- [1738] arXiv:2403.12503 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Securing Large Language Models: Threats, Vulnerabilities and Responsible PracticesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Large language models (LLMs) have significantly transformed the landscape of Natural Language Processing (NLP). Their impact extends across a diverse spectrum of tasks, revolutionizing how we approach language understanding and generations. Nevertheless, alongside their remarkable utility, LLMs introduce critical security and risk considerations. These challenges warrant careful examination to ensure responsible deployment and safeguard against potential vulnerabilities. This research paper thoroughly investigates security and privacy concerns related to LLMs from five thematic perspectives: security and privacy concerns, vulnerabilities against adversarial attacks, potential harms caused by misuses of LLMs, mitigation strategies to address these challenges while identifying limitations of current strategies. Lastly, the paper recommends promising avenues for future research to enhance the security and risk management of LLMs.
- [1739] arXiv:2403.12510 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Generalized Consistency Trajectory Models for Image ManipulationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Diffusion-based generative models excel in unconditional generation, as well as on applied tasks such as image editing and restoration. The success of diffusion models lies in the iterative nature of diffusion: diffusion breaks down the complex process of mapping noise to data into a sequence of simple denoising tasks. Moreover, we are able to exert fine-grained control over the generation process by injecting guidance terms into each denoising step. However, the iterative process is also computationally intensive, often taking from tens up to thousands of function evaluations. Although consistency trajectory models (CTMs) enable traversal between any time points along the probability flow ODE (PFODE) and score inference with a single function evaluation, CTMs only allow translation from Gaussian noise to data. Thus, this work aims to unlock the full potential of CTMs by proposing generalized CTMs (GCTMs), which translate between arbitrary distributions via ODEs. We discuss the design space of GCTMs and demonstrate their efficacy in various image manipulation tasks such as image-to-image translation, restoration, and editing. Code: \url{ this https URL }
- [1740] arXiv:2403.12523 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: GraphERE: Jointly Multiple Event-Event Relation Extraction via Graph-Enhanced Event EmbeddingsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Events describe the state changes of entities. In a document, multiple events are connected by various relations (e.g., Coreference, Temporal, Causal, and Subevent). Therefore, obtaining the connections between events through Event-Event Relation Extraction (ERE) is critical to understand natural language. There are two main problems in the current ERE works: a. Only embeddings of the event triggers are used for event feature representation, ignoring event arguments (e.g., time, place, person, etc.) and their structure within the event. b. The interconnection between relations (e.g., temporal and causal relations usually interact with each other ) is ignored. To solve the above problems, this paper proposes a jointly multiple ERE framework called GraphERE based on Graph-enhanced Event Embeddings. First, we enrich the event embeddings with event argument and structure features by using static AMR graphs and IE graphs; Then, to jointly extract multiple event relations, we use Node Transformer and construct Task-specific Dynamic Event Graphs for each type of relation. Finally, we used a multi-task learning strategy to train the whole framework. Experimental results on the latest MAVEN-ERE dataset validate that GraphERE significantly outperforms existing methods. Further analyses indicate the effectiveness of the graph-enhanced event embeddings and the joint extraction strategy.
- [1741] arXiv:2403.12533 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: To Help or Not to Help: LLM-based Attentive Support for Human-Robot Group InteractionsDaniel Tanneberg , Felix Ocker , Stephan Hasler , Joerg Deigmoeller , Anna Belardinelli , Chao Wang , Heiko Wersing , Bernhard Sendhoff , Michael GiengerComments: 8 pages, 5 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: How can a robot provide unobtrusive physical support within a group of humans? We present Attentive Support, a novel interaction concept for robots to support a group of humans. It combines scene perception, dialogue acquisition, situation understanding, and behavior generation with the common-sense reasoning capabilities of Large Language Models (LLMs). In addition to following user instructions, Attentive Support is capable of deciding when and how to support the humans, and when to remain silent to not disturb the group. With a diverse set of scenarios, we show and evaluate the robot's attentive behavior, which supports and helps the humans when required, while not disturbing if no help is needed.
- [1742] arXiv:2403.12552 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous DrivingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Robotics (cs.RO)
Abstract: End-to-end autonomous driving has witnessed remarkable progress. However, the extensive deployment of autonomous vehicles has yet to be realized, primarily due to 1) inefficient multi-modal environment perception: how to integrate data from multi-modal sensors more efficiently; 2) non-human-like scene understanding: how to effectively locate and predict critical risky agents in traffic scenarios like an experienced driver. To overcome these challenges, in this paper, we propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving. To better fuse multi-modal data and achieve higher alignment between different modalities, a novel Lidar-Vision-Attention-based Fusion (LVAFusion) module is proposed. By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas within complex scenarios precisely and ensure safety. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance with less data in closed-loop benchmarks. Source codes are available at https://anonymous.4open.science/r/M2DA-4772.
- [1743] arXiv:2403.12562 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Equity through Access: A Case for Small-scale Deep LearningComments: Source code available at this https URLSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: The recent advances in deep learning (DL) have been accelerated by access to large-scale data and compute. These large-scale resources have been used to train progressively larger models which are resource intensive in terms of compute, data, energy, and carbon emissions. These costs are becoming a new type of entry barrier to researchers and practitioners with limited access to resources at such scale, particularly in the Global South. In this work, we take a comprehensive look at the landscape of existing DL models for vision tasks and demonstrate their usefulness in settings where resources are limited. To account for the resource consumption of DL models, we introduce a novel measure to estimate the performance per resource unit, which we call the PePR score. Using a diverse family of 131 unique DL architectures (spanning 1M to 130M trainable parameters) and three medical image datasets, we capture trends about the performance-resource trade-offs. In applications like medical image analysis, we argue that small-scale, specialized models are better than striving for large-scale models. Furthermore, we show that using pretrained models can significantly reduce the computational resources and data required. We hope this work will encourage the community to focus on improving AI equity by developing methods and models with smaller resource footprints.
- [1744] arXiv:2403.12563 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Simple Hack for Transformers against Heavy Long-Text Classification on a Time- and Memory-Limited GPU ServiceComments: The 10th International Conference on Advanced Informatics: Concepts, Theory, and Applications (ICAICTA 2023)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Many NLP researchers rely on free computational services, such as Google Colab, to fine-tune their Transformer models, causing a limitation for hyperparameter optimization (HPO) in long-text classification due to the method having quadratic complexity and needing a bigger resource. In Indonesian, only a few works were found on long-text classification using Transformers. Most only use a small amount of data and do not report any HPO. In this study, using 18k news articles, we investigate which pretrained models are recommended to use based on the output length of the tokenizer. We then compare some hacks to shorten and enrich the sequences, which are the removals of stopwords, punctuation, low-frequency words, and recurring words. To get a fair comparison, we propose and run an efficient and dynamic HPO procedure that can be done gradually on a limited resource and does not require a long-running optimization library. Using the best hack found, we then compare 512, 256, and 128 tokens length. We find that removing stopwords while keeping punctuation and low-frequency words is the best hack. Some of our setups manage to outperform taking 512 first tokens using a smaller 128 or 256 first tokens which manage to represent the same information while requiring less computational resources. The findings could help developers to efficiently pursue optimal performance of the models using limited resources.
- [1745] arXiv:2403.12568 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Memory-Efficient and Secure DNN Inference on TrustZone-enabled Consumer IoT DevicesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Edge intelligence enables resource-demanding Deep Neural Network (DNN) inference without transferring original data, addressing concerns about data privacy in consumer Internet of Things (IoT) devices. For privacy-sensitive applications, deploying models in hardware-isolated trusted execution environments (TEEs) becomes essential. However, the limited secure memory in TEEs poses challenges for deploying DNN inference, and alternative techniques like model partitioning and offloading introduce performance degradation and security issues. In this paper, we present a novel approach for advanced model deployment in TrustZone that ensures comprehensive privacy preservation during model inference. We design a memory-efficient management method to support memory-demanding inference in TEEs. By adjusting the memory priority, we effectively mitigate memory leakage risks and memory overlap conflicts, resulting in 32 lines of code alterations in the trusted operating system. Additionally, we leverage two tiny libraries: S-Tinylib (2,538 LoCs), a tiny deep learning library, and Tinylibm (827 LoCs), a tiny math library, to support efficient inference in TEEs. We implemented a prototype on Raspberry Pi 3B+ and evaluated it using three well-known lightweight DNN models. The experimental results demonstrate that our design significantly improves inference speed by 3.13 times and reduces power consumption by over 66.5% compared to non-memory optimization method in TEEs.
- [1746] arXiv:2403.12572 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Compound Expression Recognition via Multi Model EnsembleSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Compound Expression Recognition (CER) plays a crucial role in interpersonal interactions. Due to the existence of Compound Expressions , human emotional expressions are complex, requiring consideration of both local and global facial expressions to make judgments. In this paper, to address this issue, we propose a solution based on ensemble learning methods for Compound Expression Recognition. Specifically, our task is classification, where we train three expression classification models based on convolutional networks, Vision Transformers, and multi-scale local attention networks. Then, through model ensemble using late fusion, we merge the outputs of multiple models to predict the final result. Our method achieves high accuracy on RAF-DB and is able to recognize expressions through zero-shot on certain portions of C-EXPR-DB.
- [1747] arXiv:2403.12574 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural NetworksSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Abstract: Event cameras, with their high dynamic range and temporal resolution, are ideally suited for object detection, especially under scenarios with motion blur and challenging lighting conditions. However, while most existing approaches prioritize optimizing spatiotemporal representations with advanced detection backbones and early aggregation functions, the crucial issue of adaptive event sampling remains largely unaddressed. Spiking Neural Networks (SNNs), which operate on an event-driven paradigm through sparse spike communication, emerge as a natural fit for addressing this challenge. In this study, we discover that the neural dynamics of spiking neurons align closely with the behavior of an ideal temporal event sampler. Motivated by this insight, we propose a novel adaptive sampling module that leverages recurrent convolutional SNNs enhanced with temporal memory, facilitating a fully end-to-end learnable framework for event-based detection. Additionally, we introduce Residual Potential Dropout (RPD) and Spike-Aware Training (SAT) to regulate potential distribution and address performance degradation encountered in spike-based sampling modules. Through rigorous testing on neuromorphic datasets for event-based detection, our approach demonstrably surpasses existing state-of-the-art spike-based methods, achieving superior performance with significantly fewer parameters and time steps. For instance, our method achieves a 4.4\% mAP improvement on the Gen1 dataset, while requiring 38\% fewer parameters and three time steps. Moreover, the applicability and effectiveness of our adaptive sampling methodology extend beyond SNNs, as demonstrated through further validation on conventional non-spiking detection models.
- [1748] arXiv:2403.12588 (cross-list from cs.IT) [ pdf , ps , html , other ]
-
Title: Machine Learning of the Prime DistributionComments: 10 pages; parts of arXiv:2308.10817 reworked and amended; author's draft; accepted in PLOS ONESubjects: Information Theory (cs.IT) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Number Theory (math.NT)
Abstract: In the present work we use maximum entropy methods to derive several theorems in probabilistic number theory, including a version of the Hardy-Ramanujan Theorem. We also provide a theoretical argument explaining the experimental observations of Y.-H. He about the learnability of primes, and posit that the Erdős-Kac law would very unlikely be discovered by current machine learning techniques. Numerical experiments that we perform corroborate our theoretical findings.
- [1749] arXiv:2403.12589 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: FootstepNet: an Efficient Actor-Critic Method for Fast On-line Bipedal Footstep Planning and ForecastingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Designing a humanoid locomotion controller is challenging and classically split up in sub-problems. Footstep planning is one of those, where the sequence of footsteps is defined. Even in simpler environments, finding a minimal sequence, or even a feasible sequence, yields a complex optimization problem. In the literature, this problem is usually addressed by search-based algorithms (e.g. variants of A*). However, such approaches are either computationally expensive or rely on hand-crafted tuning of several parameters. In this work, at first, we propose an efficient footstep planning method to navigate in local environments with obstacles, based on state-of-the art Deep Reinforcement Learning (DRL) techniques, with very low computational requirements for on-line inference. Our approach is heuristic-free and relies on a continuous set of actions to generate feasible footsteps. In contrast, other methods necessitate the selection of a relevant discrete set of actions. Second, we propose a forecasting method, allowing to quickly estimate the number of footsteps required to reach different candidates of local targets. This approach relies on inherent computations made by the actor-critic DRL architecture. We demonstrate the validity of our approach with simulation results, and by a deployment on a kid-size humanoid robot during the RoboCup 2023 competition.
- [1750] arXiv:2403.12631 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: PointGrasp: Point Cloud-based Grasping for Tendon-driven Soft Robotic Glove ApplicationsComments: 6 pages, 8 figures, conferenceSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Controlling hand exoskeletons to assist individuals with grasping tasks poses a challenge due to the difficulty in understanding user intentions. We propose that most daily grasping tasks during activities of daily living (ADL) can be deduced by analyzing object geometries (simple and complex) from 3D point clouds. The study introduces PointGrasp, a real-time system designed for identifying household scenes semantically, aiming to support and enhance assistance during ADL for tailored end-to-end grasping tasks. The system comprises an RGB-D camera with an inertial measurement unit and a microprocessor integrated into a tendon-driven soft robotic glove. The RGB-D camera processes 3D scenes at a rate exceeding 30 frames per second. The proposed pipeline demonstrates an average RMSE of 0.8 $\pm$ 0.39 cm for simple and 0.11 $\pm$ 0.06 cm for complex geometries. Within each mode, it identifies and pinpoints reachable objects. This system shows promise in end-to-end vision-driven robotic-assisted rehabilitation manual tasks.
- [1751] arXiv:2403.12649 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: InBox: Recommendation with Knowledge Graph using Interest Box EmbeddingComments: VLDB 2024 under submissionSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Knowledge graphs (KGs) have become vitally important in modern recommender systems, effectively improving performance and interpretability. Fundamentally, recommender systems aim to identify user interests based on historical interactions and recommend suitable items. However, existing works overlook two key challenges: (1) an interest corresponds to a potentially large set of related items, and (2) the lack of explicit, fine-grained exploitation of KG information and interest connectivity. This leads to an inability to reflect distinctions between entities and interests when modeling them in a single way. Additionally, the granularity of concepts in the knowledge graphs used for recommendations tends to be coarse, failing to match the fine-grained nature of user interests. This homogenization limits the precise exploitation of knowledge graph data and interest connectivity. To address these limitations, we introduce a novel embedding-based model called InBox. Specifically, various knowledge graph entities and relations are embedded as points or boxes, while user interests are modeled as boxes encompassing interaction history. Representing interests as boxes enables containing collections of item points related to that interest. We further propose that an interest comprises diverse basic concepts, and box intersection naturally supports concept combination. Across three training steps, InBox significantly outperforms state-of-the-art methods like HAKG and KGIN on recommendation tasks. Further analysis provides meaningful insights into the variable value of different KG data for recommendations. In summary, InBox advances recommender systems through box-based interest and concept modeling for sophisticated knowledge graph exploitation.
- [1752] arXiv:2403.12660 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsPengyue Jia , Yejing Wang , Zhaocheng Du , Xiangyu Zhao , Yichao Wang , Bo Chen , Wanyu Wang , Huifeng Guo , Ruiming TangSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core challenges. Firstly, variant experimental setups across research papers often yield unfair comparisons, obscuring practical insights. Secondly, the existing literature's lack of detailed analysis on selection attributes, based on large-scale datasets and a thorough comparison among selection techniques and DRS backbones, restricts the generalizability of findings and impedes deployment on DRS. Lastly, research often focuses on comparing the peak performance achievable by feature selection methods, an approach that is typically computationally infeasible for identifying the optimal hyperparameters and overlooks evaluating the robustness and stability of these methods. To bridge these gaps, this paper presents ERASE, a comprehensive bEnchmaRk for feAture SElection for DRS. ERASE comprises a thorough evaluation of eleven feature selection methods, covering both traditional and deep learning approaches, across four public datasets, private industrial datasets, and a real-world commercial platform, achieving significant enhancement. Our code is available online for ease of reproduction.
- [1753] arXiv:2403.12664 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deciphering AutoML Ensembles: cattleia's Assistance in Decision-MakingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: In many applications, model ensembling proves to be better than a single predictive model. Hence, it is the most common post-processing technique in Automated Machine Learning (AutoML). The most popular frameworks use ensembles at the expense of reducing the interpretability of the final models. In our work, we propose cattleia - an application that deciphers the ensembles for regression, multiclass, and binary classification tasks. This tool works with models built by three AutoML packages: auto-sklearn, AutoGluon, and FLAML. The given ensemble is analyzed from different perspectives. We conduct a predictive performance investigation through evaluation metrics of the ensemble and its component models. We extend the validation perspective by introducing new measures to assess the diversity and complementarity of the model predictions. Moreover, we apply explainable artificial intelligence (XAI) techniques to examine the importance of variables. Summarizing obtained insights, we can investigate and adjust the weights with a modification tool to tune the ensemble in the desired way. The application provides the aforementioned aspects through dedicated interactive visualizations, making it accessible to a diverse audience. We believe the cattleia can support users in decision-making and deepen the comprehension of AutoML frameworks.
- [1754] arXiv:2403.12671 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-EngineeringSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: AI assistants for coding are on the rise. However one of the reasons developers and companies avoid harnessing their full potential is the questionable security of the generated code. This paper first reviews the current state-of-the-art and identifies areas for improvement on this issue. Then, we propose a systematic approach based on prompt-altering methods to achieve better code security of (even proprietary black-box) AI-based code generators such as GitHub Copilot, while minimizing the complexity of the application from the user point-of-view, the computational resources, and operational costs. In sum, we propose and evaluate three prompt altering methods: (1) scenario-specific, (2) iterative, and (3) general clause, while we discuss their combination. Contrary to the audit of code security, the latter two of the proposed methods require no expert knowledge from the user. We assess the effectiveness of the proposed methods on the GitHub Copilot using the OpenVPN project in realistic scenarios, and we demonstrate that the proposed methods reduce the number of insecure generated code samples by up to 16\% and increase the number of secure code by up to 8\%. Since our approach does not require access to the internals of the AI models, it can be in general applied to any AI-based code synthesizer, not only GitHub Copilot.
- [1755] arXiv:2403.12672 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Improving Interpretability of Scores in Anomaly Detection Based on Gaussian-Bernoulli Restricted Boltzmann MachineSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Gaussian-Bernoulli restricted Boltzmann machines (GBRBMs) are often used for semi-supervised anomaly detection, where they are trained using only normal data points. In GBRBM-based anomaly detection, normal and anomalous data are classified based on a score that is identical to an energy function of the marginal GBRBM. However, the classification threshold is difficult to set to an appropriate value, as this score cannot be interpreted. In this study, we propose a measure that improves score's interpretability based on its cumulative distribution, and establish a guideline for setting the threshold using the interpretable measure. The results of numerical experiments show that the guideline is reasonable when setting the threshold solely using normal data points. Moreover, because identifying the measure involves computationally infeasible evaluation of the minimum score value, we also propose an evaluation method for the minimum score based on simulated annealing, which is widely used for optimization problems. The proposed evaluation method was also validated using numerical experiments.
- [1756] arXiv:2403.12678 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Empowering Air Travelers: A Chatbot for Canadian Air Passenger RightsComments: under reviewSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The Canadian air travel sector has seen a significant increase in flight delays, cancellations, and other issues concerning passenger rights. Recognizing this demand, we present a chatbot to assist passengers and educate them about their rights. Our system breaks a complex user input into simple queries which are used to retrieve information from a collection of documents detailing air travel regulations. The most relevant passages from these documents are presented along with links to the original documents and the generated queries, enabling users to dissect and leverage the information for their unique circumstances. The system successfully overcomes two predominant challenges: understanding complex user inputs, and delivering accurate answers, free of hallucinations, that passengers can rely on for making informed decisions. A user study comparing the chatbot to a Google search demonstrated the chatbot's usefulness and ease of use. Beyond the primary goal of providing accurate and timely information to air passengers regarding their rights, we hope that this system will also enable further research exploring the tradeoff between the user-friendly conversational interface of chatbots and the accuracy of retrieval systems.
- [1757] arXiv:2403.12706 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AnimateDiff-Lightning: Cross-Model Diffusion DistillationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: We present AnimateDiff-Lightning for lightning-fast video generation. Our model uses progressive adversarial diffusion distillation to achieve new state-of-the-art in few-step video generation. We discuss our modifications to adapt it for the video modality. Furthermore, we propose to simultaneously distill the probability flow of multiple base diffusion models, resulting in a single distilled motion module with broader style compatibility. We are pleased to release our distilled AnimateDiff-Lightning model for the community's use.
- [1758] arXiv:2403.12723 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Python Fuzzing for Trustworthy Machine Learning FrameworksJournal-ref: Zapiski Nauchnykh Seminarov Sankt-Peterburgskogo Otdeleniya Matematicheskogo Instituta im. V. A. Steklova Rossiiskoi Akademii Nauk 530 (2023) 38-50Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Abstract: Ensuring the security and reliability of machine learning frameworks is crucial for building trustworthy AI-based systems. Fuzzing, a popular technique in secure software development lifecycle (SSDLC), can be used to develop secure and robust software. Popular machine learning frameworks such as PyTorch and TensorFlow are complex and written in multiple programming languages including C/C++ and Python. We propose a dynamic analysis pipeline for Python projects using the Sydr-Fuzz toolset. Our pipeline includes fuzzing, corpus minimization, crash triaging, and coverage collection. Crash triaging and severity estimation are important steps to ensure that the most critical vulnerabilities are addressed promptly. Furthermore, the proposed pipeline is integrated in GitLab CI. To identify the most vulnerable parts of the machine learning frameworks, we analyze their potential attack surfaces and develop fuzz targets for PyTorch, TensorFlow, and related projects such as h5py. Applying our dynamic analysis pipeline to these targets, we were able to discover 3 new bugs and propose fixes for them.
- [1759] arXiv:2403.12730 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: What Does Evaluation of Explainable Artificial Intelligence Actually Tell Us? A Case for Compositional and Contextual Validation of XAI Building BlocksComments: Published in Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24)Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Despite significant progress, evaluation of explainable artificial intelligence remains elusive and challenging. In this paper we propose a fine-grained validation framework that is not overly reliant on any one facet of these sociotechnical systems, and that recognises their inherent modular structure: technical building blocks, user-facing explanatory artefacts and social communication protocols. While we concur that user studies are invaluable in assessing the quality and effectiveness of explanation presentation and delivery strategies from the explainees' perspective in a particular deployment context, the underlying explanation generation mechanisms require a separate, predominantly algorithmic validation strategy that accounts for the technical and human-centred desiderata of their (numerical) outputs. Such a comprehensive sociotechnical utility-based evaluation framework could allow to systematically reason about the properties and downstream influence of different building blocks from which explainable artificial intelligence systems are composed -- accounting for a diverse range of their engineering and social aspects -- in view of the anticipated use case.
- [1760] arXiv:2403.12748 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Building Brain Tumor Segmentation Networks with User-Assisted Filter Estimation and SelectionComments: 10 pages, 5 figures, 2 tables, 24 references, manuscript of conference paperSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Brain tumor image segmentation is a challenging research topic in which deep-learning models have presented the best results. However, the traditional way of training those models from many pre-annotated images leaves several unanswered questions. Hence methodologies, such as Feature Learning from Image Markers (FLIM), have involved an expert in the learning loop to reduce human effort in data annotation and build models sufficiently deep for a given problem. FLIM has been successfully used to create encoders, estimating the filters of all convolutional layers from patches centered at marker voxels. In this work, we present Multi-Step (MS) FLIM - a user-assisted approach to estimating and selecting the most relevant filters from multiple FLIM executions. MS-FLIM is used only for the first convolutional layer, and the results already indicate improvement over FLIM. For evaluation, we build a simple U-shaped encoder-decoder network, named sU-Net, for glioblastoma segmentation using T1Gd and FLAIR MRI scans, varying the encoder's training method, using FLIM, MS-FLIM, and backpropagation algorithm. Also, we compared these sU-Nets with two State-Of-The-Art (SOTA) deep-learning models using two datasets. The results show that the sU-Net based on MS-FLIM outperforms the other training methods and achieves effectiveness within the standard deviations of the SOTA models.
- [1761] arXiv:2403.12777 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Discover and Mitigate Multiple Biased Subgroups in Image ClassifiersComments: CVPR 2024. Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Machine learning models can perform well on in-distribution data but often fail on biased subgroups that are underrepresented in the training data, hindering the robustness of models for reliable applications. Such subgroups are typically unknown due to the absence of subgroup labels. Discovering biased subgroups is the key to understanding models' failure modes and further improving models' robustness. Most previous works of subgroup discovery make an implicit assumption that models only underperform on a single biased subgroup, which does not hold on in-the-wild data where multiple biased subgroups exist.
In this work, we propose Decomposition, Interpretation, and Mitigation (DIM), a novel method to address a more challenging but also more practical problem of discovering multiple biased subgroups in image classifiers. Our approach decomposes the image features into multiple components that represent multiple subgroups. This decomposition is achieved via a bilinear dimension reduction method, Partial Least Square (PLS), guided by useful supervision from the image classifier. We further interpret the semantic meaning of each subgroup component by generating natural language descriptions using vision-language foundation models. Finally, DIM mitigates multiple biased subgroups simultaneously via two strategies, including the data- and model-centric strategies. Extensive experiments on CIFAR-100 and Breeds datasets demonstrate the effectiveness of DIM in discovering and mitigating multiple biased subgroups. Furthermore, DIM uncovers the failure modes of the classifier on Hard ImageNet, showcasing its broader applicability to understanding model bias in image classifiers. The code is available at this https URL . - [1762] arXiv:2403.12799 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Investigating Text Shortening Strategy in BERT: Truncation vs SummarizationComments: The 13th International Conference on Advanced Computer Science and Information Systems (ICACSIS 2021)Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The parallelism of Transformer-based models comes at the cost of their input max-length. Some studies proposed methods to overcome this limitation, but none of them reported the effectiveness of summarization as an alternative. In this study, we investigate the performance of document truncation and summarization in text classification tasks. Each of the two was investigated with several variations. This study also investigated how close their performances are to the performance of full-text. We used a dataset of summarization tasks based on Indonesian news articles (IndoSum) to do classification tests. This study shows how the summaries outperform the majority of truncation method variations and lose to only one. The best strategy obtained in this study is taking the head of the document. The second is extractive summarization. This study explains what happened to the result, leading to further research in order to exploit the potential of document summarization as a shortening alternative. The code and data used in this work are publicly available in this https URL .
- [1763] arXiv:2403.12809 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language ModelsComments: Accepted at NAACL 2024 Main ConferenceSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In many real natural language processing application scenarios, practitioners not only aim to maximize predictive performance but also seek faithful explanations for the model predictions. Rationales and importance distribution given by feature attribution methods (FAs) provide insights into how different parts of the input contribute to a prediction. Previous studies have explored how different factors affect faithfulness, mainly in the context of monolingual English models. On the other hand, the differences in FA faithfulness between multilingual and monolingual models have yet to be explored. Our extensive experiments, covering five languages and five popular FAs, show that FA faithfulness varies between multilingual and monolingual models. We find that the larger the multilingual model, the less faithful the FAs are compared to its counterpart monolingual models.Our further analysis shows that the faithfulness disparity is potentially driven by the differences between model tokenizers. Our code is available: this https URL .
- [1764] arXiv:2403.12816 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Re-identification from histopathology imagesComments: 20 pages, 7 figures, 2 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In numerous studies, deep learning algorithms have proven their potential for the analysis of histopathology images, for example, for revealing the subtypes of tumors or the primary origin of metastases. These models require large datasets for training, which must be anonymized to prevent possible patient identity leaks. This study demonstrates that even relatively simple deep learning algorithms can re-identify patients in large histopathology datasets with substantial accuracy. We evaluated our algorithms on two TCIA datasets including lung squamous cell carcinoma (LSCC) and lung adenocarcinoma (LUAD). We also demonstrate the algorithm's performance on an in-house dataset of meningioma tissue. We predicted the source patient of a slide with F1 scores of 50.16 % and 52.30 % on the LSCC and LUAD datasets, respectively, and with 62.31 % on our meningioma dataset. Based on our findings, we formulated a risk assessment scheme to estimate the risk to the patient's privacy prior to publication.
- [1765] arXiv:2403.12821 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph TransformerComments: CVPR 2024 Camera-ReadySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The success of a specific neural network architecture is closely tied to the dataset and task it tackles; there is no one-size-fits-all solution. Thus, considerable efforts have been made to quickly and accurately estimate the performances of neural architectures, without full training or evaluation, for given tasks and datasets. Neural architecture encoding has played a crucial role in the estimation, and graphbased methods, which treat an architecture as a graph, have shown prominent performance. For enhanced representation learning of neural architectures, we introduce FlowerFormer, a powerful graph transformer that incorporates the information flows within a neural architecture. FlowerFormer consists of two key components: (a) bidirectional asynchronous message passing, inspired by the flows; (b) global attention built on flow-based masking. Our extensive experiments demonstrate the superiority of FlowerFormer over existing neural encoding methods, and its effectiveness extends beyond computer vision models to include graph neural networks and auto speech recognition models. Our code is available at this http URL .
- [1766] arXiv:2403.12823 (cross-list from cs.LO) [ pdf , ps , other ]
-
Title: Answer Set Programming for Flexible Payroll ManagementComments: Under consideration in Theory and Practice of Logic Programming (TPLP)Subjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI)
Abstract: Payroll management is a critical business task that is subject to a large number of rules, which vary widely between companies, sectors, and countries. Moreover, the rules are often complex and change regularly. Therefore, payroll management systems must be flexible in design. In this paper, we suggest an approach based on a flexible Answer Set Programming (ASP) model and an easy-to-read tabular representation based on the Decision Model and Notation (DMN) standard. It allows HR consultants to represent complex rules without the need for a software engineer, and to ultimately design payroll systems for a variety of different scenarios. We show how the multi-shot solving capabilities of the clingo ASP system can be used to reach the performance that is necessary to handle real-world instances.
- [1767] arXiv:2403.12853 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: RASP: A Drone-based Reconfigurable Actuation and Sensing Platform Towards Ambient Intelligent SystemsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Realizing consumer-grade drones that are as useful as robot vacuums throughout our homes or personal smartphones in our daily lives requires drones to sense, actuate, and respond to general scenarios that may arise. Towards this vision, we propose RASP, a modular and reconfigurable sensing and actuation platform that allows drones to autonomously swap onboard sensors and actuators in only 25 seconds, allowing a single drone to quickly adapt to a diverse range of tasks. RASP consists of a mechanical layer to physically swap sensor modules, an electrical layer to maintain power and communication lines to the sensor/actuator, and a software layer to maintain a common interface between the drone and any sensor module in our platform. Leveraging recent advances in large language and visual language models, we further introduce the architecture, implementation, and real-world deployments of a personal assistant system utilizing RASP. We demonstrate that RASP can enable a diverse range of useful tasks in home, office, lab, and other indoor settings.
- [1768] arXiv:2403.12891 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Adaptive Visual Imitation Learning for Robotic Assisted Feeding Across Varied Bowl Configurations and Food TypesSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: In this study, we introduce a novel visual imitation network with a spatial attention module for robotic assisted feeding (RAF). The goal is to acquire (i.e., scoop) food items from a bowl. However, achieving robust and adaptive food manipulation is particularly challenging. To deal with this, we propose a framework that integrates visual perception with imitation learning to enable the robot to handle diverse scenarios during scooping. Our approach, named AVIL (adaptive visual imitation learning), exhibits adaptability and robustness across different bowl configurations in terms of material, size, and position, as well as diverse food types including granular, semi-solid, and liquid, even in the presence of distractors. We validate the effectiveness of our approach by conducting experiments on a real robot. We also compare its performance with a baseline. The results demonstrate improvement over the baseline across all scenarios, with an enhancement of up to 2.5 times in terms of a success metric. Notably, our model, trained solely on data from a transparent glass bowl containing granular cereals, showcases generalization ability when tested zero-shot on other bowl configurations with different types of food.
- [1769] arXiv:2403.12900 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model InferenceSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The rapid advancement of Generative Artificial Intelligence (GenAI) across diverse sectors raises significant environmental concerns, notably the carbon emissions from their cloud and high performance computing (HPC) infrastructure. This paper presents Sprout, an innovative framework designed to address these concerns by reducing the carbon footprint of generative Large Language Model (LLM) inference services. Sprout leverages the innovative concept of "generation directives" to guide the autoregressive generation process, thereby enhancing carbon efficiency. Our proposed method meticulously balances the need for ecological sustainability with the demand for high-quality generation outcomes. Employing a directive optimizer for the strategic assignment of generation directives to user prompts and an original offline quality evaluator, Sprout demonstrates a significant reduction in carbon emissions by over 40% in real-world evaluations using the Llama2 LLM and global electricity grid data. This research marks a critical step toward aligning AI technology with sustainable practices, highlighting the potential for mitigating environmental impacts in the rapidly expanding domain of generative artificial intelligence.
- [1770] arXiv:2403.12910 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Yell At Your Robot: Improving On-the-Fly from Language CorrectionsLucy Xiaoyang Shi , Zheyuan Hu , Tony Z. Zhao , Archit Sharma , Karl Pertsch , Jianlan Luo , Sergey Levine , Chelsea FinnComments: Project website: this https URLSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Hierarchical policies that combine language and low-level control have been shown to perform impressively long-horizon robotic tasks, by leveraging either zero-shot high-level planners like pretrained language and vision-language models (LLMs/VLMs) or models trained on annotated robotic demonstrations. However, for complex and dexterous skills, attaining high success rates on long-horizon tasks still represents a major challenge -- the longer the task is, the more likely it is that some stage will fail. Can humans help the robot to continuously improve its long-horizon task performance through intuitive and natural feedback? In this paper, we make the following observation: high-level policies that index into sufficiently rich and expressive low-level language-conditioned skills can be readily supervised with human feedback in the form of language corrections. We show that even fine-grained corrections, such as small movements ("move a bit to the left"), can be effectively incorporated into high-level policies, and that such corrections can be readily obtained from humans observing the robot and making occasional suggestions. This framework enables robots not only to rapidly adapt to real-time language feedback, but also incorporate this feedback into an iterative training scheme that improves the high-level policy's ability to correct errors in both low-level execution and high-level decision-making purely from verbal feedback. Our evaluation on real hardware shows that this leads to significant performance improvement in long-horizon, dexterous manipulation tasks without the need for any additional teleoperation. Videos and code are available at this https URL .
- [1771] arXiv:2403.12918 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource TextsComments: Accepted as a long paper to NAACL 2024 Main Conference; 18 pages, 11 tables, 3 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Pretrained Language Models (PLMs) have advanced Natural Language Processing (NLP) tasks significantly, but finetuning PLMs on low-resource datasets poses significant challenges such as instability and overfitting. Previous methods tackle these issues by finetuning a strategically chosen subnetwork on a downstream task, while keeping the remaining weights fixed to the pretrained weights. However, they rely on a suboptimal criteria for sub-network selection, leading to suboptimal solutions. To address these limitations, we propose a regularization method based on attention-guided weight mixup for finetuning PLMs. Our approach represents each network weight as a mixup of task-specific weight and pretrained weight, controlled by a learnable attention parameter, providing finer control over sub-network selection. Furthermore, we employ a bi-level optimization (BLO) based framework on two separate splits of the training dataset, improving generalization and combating overfitting. We validate the efficacy of our proposed method through extensive experiments, demonstrating its superiority over previous methods, particularly in the context of finetuning PLMs on low-resource datasets.
- [1772] arXiv:2403.12936 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Automatic Information Extraction From Employment Tribunal Judgements Using Large Language ModelsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Court transcripts and judgments are rich repositories of legal knowledge, detailing the intricacies of cases and the rationale behind judicial decisions. The extraction of key information from these documents provides a concise overview of a case, crucial for both legal experts and the public. With the advent of large language models (LLMs), automatic information extraction has become increasingly feasible and efficient. This paper presents a comprehensive study on the application of GPT-4, a large language model, for automatic information extraction from UK Employment Tribunal (UKET) cases. We meticulously evaluated GPT-4's performance in extracting critical information with a manual verification process to ensure the accuracy and relevance of the extracted data. Our research is structured around two primary extraction tasks: the first involves a general extraction of eight key aspects that hold significance for both legal specialists and the general public, including the facts of the case, the claims made, references to legal statutes, references to precedents, general case outcomes and corresponding labels, detailed order and remedies and reasons for the decision. The second task is more focused, aimed at analysing three of those extracted features, namely facts, claims and outcomes, in order to facilitate the development of a tool capable of predicting the outcome of employment law disputes. Through our analysis, we demonstrate that LLMs like GPT-4 can obtain high accuracy in legal information extraction, highlighting the potential of LLMs in revolutionising the way legal information is processed and utilised, offering significant implications for legal research and practice.
- [1773] arXiv:2403.12943 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention TransformersVidhi Jain , Maria Attarian , Nikhil J Joshi , Ayzaan Wahid , Danny Driess , Quan Vuong , Pannag R Sanketi , Pierre Sermanet , Stefan Welker , Christine Chan , Igor Gilitschenski , Yonatan Bisk , Debidatta DwibediComments: Robot learning: Imitation Learning, Robot Perception, Sensing & Vision, Grasping & ManipulationSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots. Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions. This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory. The model leverages cross-attention mechanisms to fuse prompt video features to the robot's current state and generate appropriate actions that mimic the observed task. To further improve policy performance, we propose auxiliary contrastive losses that enhance the alignment between human and robot video representations. We evaluate Vid2Robot on real-world robots, demonstrating a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. Additionally, our model exhibits emergent capabilities, such as successfully transferring observed motions from one object to another, and long-horizon composition, thus showcasing its potential for real-world applications. Project website: this http URL
- [1774] arXiv:2403.12952 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting. Despite their promise, the effectiveness of these models often diminishes due to domain shifts in test environments. To address this, we introduce the Test-Time Prototype Shifting (TPS) framework, a pioneering approach designed to adapt VLMs to test datasets using unlabeled test inputs. Our method is based on the notion of modulating per-class prototypes in the shared embedding space. By pre-computing and caching prototypes generated with the pre-trained text encoder, TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering. At test-time, TPS dynamically learns shift vectors for each prototype based solely on the given test sample, effectively bridging the domain gap and enhancing classification accuracy. A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods. Extensive evaluations across 15 datasets involving natural distribution shifts and cross-dataset generalization demonstrate TPS's superior performance, achieving state-of-the-art results while reducing resource requirements.
- [1775] arXiv:2403.12959 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: WHAC: World-grounded Humans and CamerasWanqi Yin , Zhongang Cai , Ruisi Wang , Fanzhou Wang , Chen Wei , Haiyi Mei , Weiye Xiao , Zhitao Yang , Qingping Sun , Atsushi Yamashita , Ziwei Liu , Lei YangComments: Homepage: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our approach is founded on two key observations. Firstly, camera-frame SMPL-X estimation methods readily recover absolute human depth. Secondly, human motions inherently provide absolute spatial cues. By integrating these insights, we introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation (EHPS) alongside camera pose estimation, without relying on traditional optimization techniques. Additionally, we present a new synthetic dataset, WHAC-A-Mole, which includes accurately annotated humans and cameras, and features diverse interactive human motions as well as realistic camera trajectories. Extensive experiments on both standard and newly established benchmarks highlight the superiority and efficacy of our framework. We will make the code and dataset publicly available.
- [1776] arXiv:2403.12961 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: TexTile: A Differentiable Metric for Texture TileabilityComments: CVPR 2024. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Abstract: We introduce TexTile, a novel differentiable metric to quantify the degree upon which a texture image can be concatenated with itself without introducing repeating artifacts (i.e., the tileability). Existing methods for tileable texture synthesis focus on general texture quality, but lack explicit analysis of the intrinsic repeatability properties of a texture. In contrast, our TexTile metric effectively evaluates the tileable properties of a texture, opening the door to more informed synthesis and analysis of tileable textures. Under the hood, TexTile is formulated as a binary classifier carefully built from a large dataset of textures of different styles, semantics, regularities, and human annotations.Key to our method is a set of architectural modifications to baseline pre-train image classifiers to overcome their shortcomings at measuring tileability, along with a custom data augmentation and training regime aimed at increasing robustness and accuracy. We demonstrate that TexTile can be plugged into different state-of-the-art texture synthesis methods, including diffusion-based strategies, and generate tileable textures while keeping or even improving the overall texture quality. Furthermore, we show that TexTile can objectively evaluate any tileable texture synthesis method, whereas the current mix of existing metrics produces uncorrelated scores which heavily hinders progress in the field.
- [1777] arXiv:2403.12981 (cross-list from cs.DC) [ pdf , ps , html , other ]
-
Title: Beyond Inference: Performance Analysis of DNN Server Overheads for Computer VisionComments: 6 pages, 11 figures, DAC 2024: 61st IEEE/ACM Design Automation Conference. (DAC'24)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Deep neural network (DNN) inference has become an important part of many data-center workloads. This has prompted focused efforts to design ever-faster deep learning accelerators such as GPUs and TPUs. However, an end-to-end DNN-based vision application contains more than just DNN inference, including input decompression, resizing, sampling, normalization, and data transfer. In this paper, we perform a thorough evaluation of computer vision inference requests performed on a throughput-optimized serving system. We quantify the performance impact of server overheads such as data movement, preprocessing, and message brokers between two DNNs producing outputs at different rates. Our empirical analysis encompasses many computer vision tasks including image classification, segmentation, detection, depth-estimation, and more complex processing pipelines with multiple DNNs. Our results consistently demonstrate that end-to-end application performance can easily be dominated by data processing and data movement functions (up to 56% of end-to-end latency in a medium-sized image, and $\sim$ 80% impact on system throughput in a large image), even though these functions have been conventionally overlooked in deep learning system design. Our work identifies important performance bottlenecks in different application scenarios, achieves 2.25$\times$ better throughput compared to prior work, and paves the way for more holistic deep learning system design.
- [1778] arXiv:2403.12988 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Improving the Robustness of Object Detection and Classification AI models against Adversarial Patch AttacksSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Adversarial patch attacks, crafted to compromise the integrity of Deep Neural Networks (DNNs), significantly impact Artificial Intelligence (AI) systems designed for object detection and classification tasks. The primary purpose of this work is to defend models against real-world physical attacks that target object detection and classification. We analyze attack techniques and propose a robust defense approach. We successfully reduce model confidence by over 20% using adversarial patch attacks that exploit object shape, texture and position. Leveraging the inpainting pre-processing technique, we effectively restore the original confidence levels, demonstrating the importance of robust defenses in mitigating these threats. Following fine-tuning of an AI model for traffic sign classification, we subjected it to a simulated pixelized patch-based physical adversarial attack, resulting in misclassifications. Our inpainting defense approach significantly enhances model resilience, achieving high accuracy and reliable localization despite the adversarial attacks. This contribution advances the resilience and reliability of object detection and classification networks against adversarial challenges, providing a robust foundation for critical applications.
- [1779] arXiv:2403.12997 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: A Multi-Task Oriented Semantic Communication Framework for Autonomous VehiclesSubjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
Abstract: Task-oriented semantic communication is an emerging technology that transmits only the relevant semantics of a message instead of the whole message to achieve a specific task. It reduces latency, compresses the data, and is more robust in low SNR scenarios. This work presents a multi-task-oriented semantic communication framework for connected and autonomous vehicles (CAVs). We propose a convolutional autoencoder (CAE) that performs the semantic encoding of the road traffic signs. These encoded images are then transmitted from one CAV to another CAV through satellite in challenging weather conditions where visibility is impaired. In addition, we propose task-oriented semantic decoders for image reconstruction and classification tasks. Simulation results show that the proposed framework outperforms the conventional schemes, such as QAM-16, regarding the reconstructed image's similarity and the classification's accuracy. In addition, it can save up to 89 % of the bandwidth by sending fewer bits.
- [1780] arXiv:2403.12999 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: Prompt Selection and Augmentation for Few Examples Code Generation in Large Language Model and its Application in Robotics ControlComments: 17 pages, 4 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Few-shot prompting and step-by-step reasoning have enhanced the capabilities of Large Language Models (LLMs) in tackling complex tasks including code generation. In this paper, we introduce a prompt selection and augmentation algorithm aimed at improving mathematical reasoning and robot arm operations. Our approach incorporates a multi-stage example augmentation scheme combined with an example selection scheme. This algorithm improves LLM performance by selecting a set of examples that increase diversity, minimize redundancy, and increase relevance to the question. When combined with the Program-of-Thought prompting, our algorithm demonstrates an improvement in performance on the GSM8K and SVAMP benchmarks, with increases of 0.3% and 1.1% respectively. Furthermore, in simulated tabletop environments, our algorithm surpasses the Code-as-Policies approach by achieving a 3.4% increase in successful task completions and a decrease of over 70% in the number of examples used. Its ability to discard examples that contribute little to solving the problem reduces the inferencing time of an LLM-powered robotics system. This algorithm also offers important benefits for industrial process automation by streamlining the development and deployment process, reducing manual programming effort, and enhancing code reusability.
- [1781] arXiv:2403.13000 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Duwak: Dual Watermarks in Large Language ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Abstract: As large language models (LLM) are increasingly used for text generation tasks, it is critical to audit their usages, govern their applications, and mitigate their potential harms. Existing watermark techniques are shown effective in embedding single human-imperceptible and machine-detectable patterns without significantly affecting generated text quality and semantics. However, the efficiency in detecting watermarks, i.e., the minimum number of tokens required to assert detection with significance and robustness against post-editing, is still debatable. In this paper, we propose, Duwak, to fundamentally enhance the efficiency and quality of watermarking by embedding dual secret patterns in both token probability distribution and sampling schemes. To mitigate expression degradation caused by biasing toward certain tokens, we design a contrastive search to watermark the sampling scheme, which minimizes the token repetition and enhances the diversity. We theoretically explain the interdependency of the two watermarks within Duwak. We evaluate Duwak extensively on Llama2 under various post-editing attacks, against four state-of-the-art watermarking techniques and combinations of them. Our results show that Duwak marked text achieves the highest watermarked text quality at the lowest required token count for detection, up to 70% tokens less than existing approaches, especially under post paraphrasing.
- [1782] arXiv:2403.13001 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Fundamental Components of Deep Learning: A category-theoretic approachComments: PhD Thesis defended at University of StrathclydeSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Category Theory (math.CT)
Abstract: Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.
- [1783] arXiv:2403.13002 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: AutoTRIZ: Artificial Ideation with TRIZ and Large Language ModelsComments: 13pages, 6 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Researchers and innovators have made enormous efforts in developing ideation methods, such as morphological analysis and design-by-analogy, to aid engineering design ideation for problem solving and innovation. Among these, TRIZ stands out as the most well-known approach, widely applied for systematic innovation. However, the complexity of TRIZ resources and concepts, coupled with its reliance on users' knowledge, experience, and reasoning capabilities, limits its practicability. This paper proposes AutoTRIZ, an artificial ideation tool that leverages large language models (LLMs) to automate and enhance the TRIZ methodology. By leveraging the broad knowledge and advanced reasoning capabilities of LLMs, AutoTRIZ offers a novel approach to design automation and interpretable ideation with artificial intelligence. We demonstrate and evaluate the effectiveness of AutoTRIZ through consistency experiments in contradiction detection and comparative studies with cases collected from TRIZ textbooks. Moreover, the proposed LLM-based framework holds the potential for extension to automate other knowledge-based ideation methods, including SCAMPER, Design Heuristics, and Design-by-Analogy, paving the way for a new era of artificial ideation for design and innovation.
- [1784] arXiv:2403.13004 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: (Beyond) Reasonable Doubt: Challenges that Public Defenders Face in Scrutinizing AI in CourtComments: 29 pages, 4 figures. To appear in Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24)Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Accountable use of AI systems in high-stakes settings relies on making systems contestable. In this paper we study efforts to contest AI systems in practice by studying how public defenders scrutinize AI in court. We present findings from interviews with 17 people in the U.S. public defense community to understand their perceptions of and experiences scrutinizing computational forensic software (CFS) -- automated decision systems that the government uses to convict and incarcerate, such as facial recognition, gunshot detection, and probabilistic genotyping tools. We find that our participants faced challenges assessing and contesting CFS reliability due to difficulties (a) navigating how CFS is developed and used, (b) overcoming judges and jurors' non-critical perceptions of CFS, and (c) gathering CFS expertise. To conclude, we provide recommendations that center the technical, social, and institutional context to better position interventions such as performance evaluations to support contestability in practice.
- [1785] arXiv:2403.13017 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Impart: An Imperceptible and Effective Label-Specific Backdoor AttackSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Backdoor attacks have been shown to impose severe threats to real security-critical scenarios. Although previous works can achieve high attack success rates, they either require access to victim models which may significantly reduce their threats in practice, or perform visually noticeable in stealthiness. Besides, there is still room to improve the attack success rates in the scenario that different poisoned samples may have different target labels (a.k.a., the all-to-all setting). In this study, we propose a novel imperceptible backdoor attack framework, named Impart, in the scenario where the attacker has no access to the victim model. Specifically, in order to enhance the attack capability of the all-to-all setting, we first propose a label-specific attack. Different from previous works which try to find an imperceptible pattern and add it to the source image as the poisoned image, we then propose to generate perturbations that align with the target label in the image feature by a surrogate model. In this way, the generated poisoned images are attached with knowledge about the target class, which significantly enhances the attack capability.
- [1786] arXiv:2403.13018 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Invisible Backdoor Attack Through Singular Value DecompositionSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: With the widespread application of deep learning across various domains, concerns about its security have grown significantly. Among these, backdoor attacks pose a serious security threat to deep neural networks (DNNs). In recent years, backdoor attacks on neural networks have become increasingly sophisticated, aiming to compromise the security and trustworthiness of models by implanting hidden, unauthorized functionalities or triggers, leading to misleading predictions or behaviors. To make triggers less perceptible and imperceptible, various invisible backdoor attacks have been proposed. However, most of them only consider invisibility in the spatial domain, making it easy for recent defense methods to detect the generated toxic this http URL address these challenges, this paper proposes an invisible backdoor attack called DEBA. DEBA leverages the mathematical properties of Singular Value Decomposition (SVD) to embed imperceptible backdoors into models during the training phase, thereby causing them to exhibit predefined malicious behavior under specific trigger conditions. Specifically, we first perform SVD on images, and then replace the minor features of trigger images with those of clean images, using them as triggers to ensure the effectiveness of the attack. As minor features are scattered throughout the entire image, the major features of clean images are preserved, making poisoned images visually indistinguishable from clean ones. Extensive experimental evaluations demonstrate that DEBA is highly effective, maintaining high perceptual quality and a high attack success rate for poisoned images. Furthermore, we assess the performance of DEBA under existing defense measures, showing that it is robust and capable of significantly evading and resisting the effects of these defense measures.
- [1787] arXiv:2403.13031 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: RigorLLM: Resilient Guardrails for Large Language Models against Undesired ContentSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Recent advancements in Large Language Models (LLMs) have showcased remarkable capabilities across various tasks in different domains. However, the emergence of biases and the potential for generating harmful content in LLMs, particularly under malicious inputs, pose significant challenges. Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently and effectively moderate harmful and unsafe inputs and outputs for LLMs. By employing a multi-faceted approach that includes energy-based training data augmentation through Langevin dynamics, optimizing a safe suffix for inputs via minimax optimization, and integrating a fusion-based model combining robust KNN with LLMs based on our data augmentation, RigorLLM offers a robust solution to harmful content moderation. Our experimental evaluations demonstrate that RigorLLM not only outperforms existing baselines like OpenAI API and Perspective API in detecting harmful content but also exhibits unparalleled resilience to jailbreaking attacks. The innovative use of constrained optimization and a fusion-based guardrail approach represents a significant step forward in developing more secure and reliable LLMs, setting a new standard for content moderation frameworks in the face of evolving digital threats.
- [1788] arXiv:2403.13040 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Physics-Guided Neural Networks for Intraventricular Vector Flow MappingHang Jung Ling , Salomé Bru , Julia Puig , Florian Vixège , Simon Mendez , Franck Nicoud , Pierre-Yves Courand , Olivier Bernard , Damien GarciaComments: 11 pages, submitted to IEEE TUFFCSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Intraventricular vector flow mapping (iVFM) seeks to enhance and quantify color Doppler in cardiac imaging. In this study, we propose novel alternatives to the traditional iVFM optimization scheme by utilizing physics-informed neural networks (PINNs) and a physics-guided nnU-Net-based supervised approach. Through rigorous evaluation on simulated color Doppler images derived from a patient-specific computational fluid dynamics model and in vivo Doppler acquisitions, both approaches demonstrate comparable reconstruction performance to the original iVFM algorithm. The efficiency of PINNs is boosted through dual-stage optimization and pre-optimized weights. On the other hand, the nnU-Net method excels in generalizability and real time capabilities. Notably, nnU-Net shows superior robustness on sparse and truncated Doppler data while maintaining independence from explicit boundary conditions. Overall, our results highlight the effectiveness of these methods in reconstructing intraventricular vector blood flow. The study also suggests potential applications of PINNs in ultrafast color Doppler imaging and the incorporation of fluid dynamics equations to derive biomarkers for cardiovascular diseases based on blood flow.
- [1789] arXiv:2403.13041 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Provable Privacy with Non-Private Pre-ProcessingSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.
- [1790] arXiv:2403.13078 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: HuLP: Human-in-the-Loop for PrognosisSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: This paper introduces HuLP, a Human-in-the-Loop for Prognosis model designed to enhance the reliability and interpretability of prognostic models in clinical contexts, especially when faced with the complexities of missing covariates and outcomes. HuLP offers an innovative approach that enables human expert intervention, empowering clinicians to interact with and correct models' predictions, thus fostering collaboration between humans and AI models to produce more accurate prognosis. Additionally, HuLP addresses the challenges of missing data by utilizing neural networks and providing a tailored methodology that effectively handles missing data. Traditional methods often struggle to capture the nuanced variations within patient populations, leading to compromised prognostic predictions. HuLP imputes missing covariates based on imaging features, aligning more closely with clinician workflows and enhancing reliability. We conduct our experiments on two real-world, publicly available medical datasets to demonstrate the superiority of HuLP.
- [1791] arXiv:2403.13091 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: JaxUED: A simple and useable UED library in JaxComments: 11 pages, 5 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We present JaxUED, an open-source library providing minimal dependency implementations of modern Unsupervised Environment Design (UED) algorithms in Jax. JaxUED leverages hardware acceleration to obtain on the order of 100x speedups compared to prior, CPU-based implementations. Inspired by CleanRL, we provide fast, clear, understandable, and easily modifiable implementations, with the aim of accelerating research into UED. This paper describes our library and contains baseline results. Code can be found at this https URL .
- [1792] arXiv:2403.13097 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Simple Ingredients for Offline Reinforcement LearningEdoardo Cetin , Andrea Tirinzoni , Matteo Pirotta , Alessandro Lazaric , Yann Ollivier , Ahmed TouatiSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task. Yet, leveraging a novel testbed (MOOD) in which trajectories come from heterogeneous sources, we show that existing methods struggle with diverse data: their performance considerably deteriorates as data collected for related but different tasks is simply added to the offline buffer. In light of this finding, we conduct a large empirical study where we formulate and test several hypotheses to explain this failure. Surprisingly, we find that scale, more than algorithmic considerations, is the key factor influencing performance. We show that simple methods like AWAC and IQL with increased network size overcome the paradoxical failure modes from the inclusion of additional data in MOOD, and notably outperform prior state-of-the-art algorithms on the canonical D4RL benchmark.
- [1793] arXiv:2403.13101 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: AdaptSFL: Adaptive Split Federated Learning in Resource-constrained Edge NetworksComments: 15 pages, 10 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: The increasing complexity of deep neural networks poses significant barriers to democratizing them to resource-limited edge devices. To address this challenge, split federated learning (SFL) has emerged as a promising solution by of floading the primary training workload to a server via model partitioning while enabling parallel training among edge devices. However, although system optimization substantially influences the performance of SFL under resource-constrained systems, the problem remains largely uncharted. In this paper, we provide a convergence analysis of SFL which quantifies the impact of model splitting (MS) and client-side model aggregation (MA) on the learning performance, serving as a theoretical foundation. Then, we propose AdaptSFL, a novel resource-adaptive SFL framework, to expedite SFL under resource-constrained edge computing systems. Specifically, AdaptSFL adaptively controls client-side MA and MS to balance communication-computing latency and training convergence. Extensive simulations across various datasets validate that our proposed AdaptSFL framework takes considerably less time to achieve a target accuracy than benchmarks, demonstrating the effectiveness of the proposed strategies.
- [1794] arXiv:2403.13106 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Knowing Your Nonlinearities: Shapley Interactions Reveal the Underlying Structure of DataSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Measuring nonlinear feature interaction is an established approach to understanding complex patterns of attribution in many models. In this paper, we use Shapley Taylor interaction indices (STII) to analyze the impact of underlying data structure on model representations in a variety of modalities, tasks, and architectures. Considering linguistic structure in masked and auto-regressive language models (MLMs and ALMs), we find that STII increases within idiomatic expressions and that MLMs scale STII with syntactic distance, relying more on syntax in their nonlinear structure than ALMs do. Our speech model findings reflect the phonetic principal that the openness of the oral cavity determines how much a phoneme varies based on its context. Finally, we study image classifiers and illustrate that feature interactions intuitively reflect object boundaries. Our wide range of results illustrates the benefits of interdisciplinary work and domain expertise in interpretability research.
- [1795] arXiv:2403.13111 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep learning with noisy labels in medical prediction problems: a scoping reviewSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Objectives: Medical research faces substantial challenges from noisy labels attributed to factors like inter-expert variability and machine-extracted labels. Despite this, the adoption of label noise management remains limited, and label noise is largely ignored. To this end, there is a critical need to conduct a scoping review focusing on the problem space. This scoping review aims to comprehensively review label noise management in deep learning-based medical prediction problems, which includes label noise detection, label noise handling, and evaluation. Research involving label uncertainty is also included.
Methods: Our scoping review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We searched 4 databases, including PubMed, IEEE Xplore, Google Scholar, and Semantic Scholar. Our search terms include "noisy label AND medical / healthcare / clinical", "un-certainty AND medical / healthcare / clinical", and "noise AND medical / healthcare / clinical".
Results: A total of 60 papers met inclusion criteria between 2016 and 2023. A series of practical questions in medical research are investigated. These include the sources of label noise, the impact of label noise, the detection of label noise, label noise handling techniques, and their evaluation. Categorization of both label noise detection methods and handling techniques are provided.
Discussion: From a methodological perspective, we observe that the medical community has been up to date with the broader deep-learning community, given that most techniques have been evaluated on medical data. We recommend considering label noise as a standard element in medical research, even if it is not dedicated to handling noisy labels. Initial experiments can start with easy-to-implement methods, such as noise-robust loss functions, weighting, and curriculum learning. - [1796] arXiv:2403.13125 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Probabilistic Circuits with Constraints via Convex OptimizationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This work addresses integrating probabilistic propositional logic constraints into the distribution encoded by a probabilistic circuit (PC). PCs are a class of tractable models that allow efficient computations (such as conditional and marginal probabilities) while achieving state-of-the-art performance in some domains. The proposed approach takes both a PC and constraints as inputs, and outputs a new PC that satisfies the constraints. This is done efficiently via convex optimization without the need to retrain the entire model. Empirical evaluations indicate that the combination of constraints and PCs can have multiple use cases, including the improvement of model performance under scarce or incomplete data, as well as the enforcement of machine learning fairness measures into the model without compromising model fitness. We believe that these ideas will open possibilities for multiple other applications involving the combination of logics and deep probabilistic models.
- [1797] arXiv:2403.13130 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Self-generated Replay Memories for Continual Neural Machine TranslationComments: Accepted at NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Modern Neural Machine Translation systems exhibit strong performance in several different languages and are constantly improving. Their ability to learn continuously is, however, still severely limited by the catastrophic forgetting issue. In this work, we leverage a key property of encoder-decoder Transformers, i.e. their generative ability, to propose a novel approach to continually learning Neural Machine Translation systems. We show how this can effectively learn on a stream of experiences comprising different languages, by leveraging a replay memory populated by using the model itself as a generator of parallel sentences. We empirically demonstrate that our approach can counteract catastrophic forgetting without requiring explicit memorization of training data. Code will be publicly available upon publication. Code: this https URL
- [1798] arXiv:2403.13134 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Robust NAS under adversarial training: benchmark, theory, and beyondSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Recent developments in neural architecture search (NAS) emphasize the significance of considering robust architectures against malicious data. However, there is a notable absence of benchmark evaluations and theoretical guarantees for searching these robust architectures, especially when adversarial training is considered. In this work, we aim to address these two challenges, making twofold contributions. First, we release a comprehensive data set that encompasses both clean accuracy and robust accuracy for a vast array of adversarially trained networks from the NAS-Bench-201 search space on image datasets. Then, leveraging the neural tangent kernel (NTK) tool from deep learning theory, we establish a generalization theory for searching architecture in terms of clean accuracy and robust accuracy under multi-objective adversarial training. We firmly believe that our benchmark and theoretical insights will significantly benefit the NAS community through reliable reproducibility, efficient assessment, and theoretical foundation, particularly in the pursuit of robust architectures.
- [1799] arXiv:2403.13150 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Training Survival Models using Scoring RulesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
Abstract: Survival Analysis provides critical insights for partially incomplete time-to-event data in various domains. It is also an important example of probabilistic machine learning. The probabilistic nature of the predictions can be exploited by using (proper) scoring rules in the model fitting process instead of likelihood-based optimization. Our proposal does so in a generic manner and can be used for a variety of model classes. We establish different parametric and non-parametric sub-frameworks that allow different degrees of flexibility. Incorporated into neural networks, it leads to a computationally efficient and scalable optimization routine, yielding state-of-the-art predictive performance. Finally, we show that using our framework, we can recover various parametric models and demonstrate that optimization works equally well when compared to likelihood-based methods.
- [1800] arXiv:2403.13178 (cross-list from stat.ML) [ pdf , ps , html , other ]
-
Title: Fast Value Tracking for Deep Reinforcement LearningSubjects: Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interacts with their environment. However, existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards, neglecting the stochastic dynamics of agent-environment interactions and the critical role of uncertainty quantification. Our research leverages the Kalman filtering paradigm to introduce a novel and scalable sampling algorithm called Langevinized Kalman Temporal-Difference (LKTD) for deep reinforcement learning. This algorithm, grounded in Stochastic Gradient Markov Chain Monte Carlo (SGMCMC), efficiently draws samples from the posterior distribution of deep neural network parameters. Under mild conditions, we prove that the posterior samples generated by the LKTD algorithm converge to a stationary distribution. This convergence not only enables us to quantify uncertainties associated with the value function and model parameters but also allows us to monitor these uncertainties during policy updates throughout the training phase. The LKTD algorithm paves the way for more robust and adaptable reinforcement learning approaches.
- [1801] arXiv:2403.13193 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: A Study of Vulnerability Repair in JavaScript Programs with Large Language ModelsComments: camera-ready version accepted to the short paper track at WWW'24Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, JavaScript has become the most widely used programming language, especially in web development. However, writing secure JavaScript code is not trivial, and programmers often make mistakes that lead to security vulnerabilities in web applications. Large Language Models (LLMs) have demonstrated substantial advancements across multiple domains, and their evolving capabilities indicate their potential for automatic code generation based on a required specification, including automatic bug fixing. In this study, we explore the accuracy of LLMs, namely ChatGPT and Bard, in finding and fixing security vulnerabilities in JavaScript programs. We also investigate the impact of context in a prompt on directing LLMs to produce a correct patch of vulnerable JavaScript code. Our experiments on real-world software vulnerabilities show that while LLMs are promising in automatic program repair of JavaScript code, achieving a correct bug fix often requires an appropriate amount of context in the prompt.
- [1802] arXiv:2403.13196 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: ADAPT to Robustify Prompt Tuning Vision TransformersSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Abstract: The performance of deep models, including Vision Transformers, is known to be vulnerable to adversarial attacks. Many existing defenses against these attacks, such as adversarial training, rely on full-model fine-tuning to induce robustness in the models. These defenses require storing a copy of the entire model, that can have billions of parameters, for each task. At the same time, parameter-efficient prompt tuning is used to adapt large transformer-based models to downstream tasks without the need to save large copies. In this paper, we examine parameter-efficient prompt tuning of Vision Transformers for downstream tasks under the lens of robustness. We show that previous adversarial defense methods, when applied to the prompt tuning paradigm, suffer from gradient obfuscation and are vulnerable to adaptive attacks. We introduce ADAPT, a novel framework for performing adaptive adversarial training in the prompt tuning paradigm. Our method achieves competitive robust accuracy of ~40% w.r.t. SOTA robustness methods using full-model fine-tuning, by tuning only ~1% of the number of parameters.
- [1803] arXiv:2403.13206 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Depth-guided NeRF Training via Earth Mover's DistanceComments: Preprint. Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-ground truth. While these depth priors are assumed to be perfect once filtered for noise, in practice, their accuracy is more challenging to capture. This work proposes a novel approach to uncertainty in depth priors for NeRF supervision. Instead of using custom-trained depth or uncertainty priors, we use off-the-shelf pretrained diffusion models to predict depth and capture uncertainty during the denoising process. Because we know that depth priors are prone to errors, we propose to supervise the ray termination distance distribution with Earth Mover's Distance instead of enforcing the rendered depth to replicate the depth prior exactly through L2-loss. Our depth-guided NeRF outperforms all baselines on standard depth metrics by a large margin while maintaining performance on photometric measures.
- [1804] arXiv:2403.13214 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Nellie: Automated organelle segmentation, tracking, and hierarchical feature extraction in 2D/3D live-cell microscopyAustin E. Y. T. Lefebvre (1), Gabriel Sturm (1 and 2), Ting-Yu Lin (1), Emily Stoops (1), Magdalena Preciado Lopez (1), Benjamin Kaufmann-Malaga (1), Kayley Hake (1) ((1) Calico Life Sciences LLC, (2) Department of Biochemistry and Biophysics, University of California San Francisco)Comments: for associated code, see this https URL 82 pages, 5 main figures, 11 extended figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Abstract: The analysis of dynamic organelles remains a formidable challenge, though key to understanding biological processes. We introduce Nellie, an automated and unbiased pipeline for segmentation, tracking, and feature extraction of diverse intracellular structures. Nellie adapts to image metadata, eliminating user input. Nellie's preprocessing pipeline enhances structural contrast on multiple intracellular scales allowing for robust hierarchical segmentation of sub-organellar regions. Internal motion capture markers are generated and tracked via a radius-adaptive pattern matching scheme, and used as guides for sub-voxel flow interpolation. Nellie extracts a plethora of features at multiple hierarchical levels for deep and customizable analysis. Nellie features a Napari-based GUI that allows for code-free operation and visualization, while its modular open-source codebase invites customization by experienced users. We demonstrate Nellie's wide variety of use cases with two examples: unmixing multiple organelles from a single channel using feature-based classification and training an unsupervised graph autoencoder on mitochondrial multi-mesh graphs to quantify latent space embedding changes following ionomycin treatment.
- [1805] arXiv:2403.13218 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Self-Attention Based Semantic Decomposition in Vector Symbolic ArchitecturesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
Abstract: Vector Symbolic Architectures (VSAs) have emerged as a novel framework for enabling interpretable machine learning algorithms equipped with the ability to reason and explain their decision processes. The basic idea is to represent discrete information through high dimensional random vectors. Complex data structures can be built up with operations over vectors such as the "binding" operation involving element-wise vector multiplication, which associates data together. The reverse task of decomposing the associated elements is a combinatorially hard task, with an exponentially large search space. The main algorithm for performing this search is the resonator network, inspired by Hopfield network-based memory search operations.
In this work, we introduce a new variant of the resonator network, based on self-attention based update rules in the iterative search problem. This update rule, based on the Hopfield network with log-sum-exp energy function and norm-bounded states, is shown to substantially improve the performance and rate of convergence. As a result, our algorithm enables a larger capacity for associative memory, enabling applications in many tasks like perception based pattern recognition, scene decomposition, and object reasoning. We substantiate our algorithm with a thorough evaluation and comparisons to baselines. - [1806] arXiv:2403.13236 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Safety-Aware Reinforcement Learning for Electric Vehicle Charging Station Management in Distribution NetworkComments: 2024 IEEE Power & Energy Society General Meeting (PESGM)Subjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Abstract: The increasing integration of electric vehicles (EVs) into the grid can pose a significant risk to the distribution system operation in the absence of coordination. In response to the need for effective coordination of EVs within the distribution network, this paper presents a safety-aware reinforcement learning (RL) algorithm designed to manage EV charging stations while ensuring the satisfaction of system constraints. Unlike existing methods, our proposed algorithm does not rely on explicit penalties for constraint violations, eliminating the need for penalty coefficient tuning. Furthermore, managing EV charging stations is further complicated by multiple uncertainties, notably the variability in solar energy generation and energy prices. To address this challenge, we develop an off-policy RL algorithm to efficiently utilize data to learn patterns in such uncertain environments. Our algorithm also incorporates a maximum entropy framework to enhance the RL algorithm's exploratory process, preventing convergence to local optimal solutions. Simulation results demonstrate that our algorithm outperforms traditional RL algorithms in managing EV charging in the distribution network.
- [1807] arXiv:2403.13244 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language ModelPeng Zhou , Jianmin Wang , Chunyan Li , Zixu Wang , Yiping Liu , Siqi Sun , Jianxin Lin , Longyue Wang , Xiangxiang ZengComments: 25 pages, 4 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 88.08%, 65.27%, and 61.44%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science. Code is available at this https URL .
- [1808] arXiv:2403.13245 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: Federated reinforcement learning for robot motion planning with zero-shot generalizationSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: This paper considers the problem of learning a control policy for robot motion planning with zero-shot generalization, i.e., no data collection and policy adaptation is needed when the learned policy is deployed in new environments. We develop a federated reinforcement learning framework that enables collaborative learning of multiple learners and a central server, i.e., the Cloud, without sharing their raw data. In each iteration, each learner uploads its local control policy and the corresponding estimated normalized arrival time to the Cloud, which then computes the global optimum among the learners and broadcasts the optimal policy to the learners. Each learner then selects between its local control policy and that from the Cloud for next iteration. The proposed framework leverages on the derived zero-shot generalization guarantees on arrival time and safety. Theoretical guarantees on almost-sure convergence, almost consensus, Pareto improvement and optimality gap are also provided. Monte Carlo simulation is conducted to evaluate the proposed framework.
- [1809] arXiv:2403.13249 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Unified and General Framework for Continual LearningComments: ICLR 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Continual Learning (CL) focuses on learning from dynamic and changing data distributions while retaining previously acquired knowledge. Various methods have been developed to address the challenge of catastrophic forgetting, including regularization-based, Bayesian-based, and memory-replay-based techniques. However, these methods lack a unified framework and common terminology for describing their approaches. This research aims to bridge this gap by introducing a comprehensive and overarching framework that encompasses and reconciles these existing methodologies. Notably, this new framework is capable of encompassing established CL approaches as special instances within a unified and general optimization objective. An intriguing finding is that despite their diverse origins, these methods share common mathematical structures. This observation highlights the compatibility of these seemingly distinct techniques, revealing their interconnectedness through a shared underlying optimization objective. Moreover, the proposed general framework introduces an innovative concept called refresh learning, specifically designed to enhance the CL performance. This novel approach draws inspiration from neuroscience, where the human brain often sheds outdated information to improve the retention of crucial knowledge and facilitate the acquisition of new information. In essence, refresh learning operates by initially unlearning current data and subsequently relearning it. It serves as a versatile plug-in that seamlessly integrates with existing CL methods, offering an adaptable and effective enhancement to the learning process. Extensive experiments on CL benchmarks and theoretical analysis demonstrate the effectiveness of the proposed refresh learning. Code is available at \url{ this https URL }.
- [1810] arXiv:2403.13257 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Arcee's MergeKit: A Toolkit for Merging Large Language ModelsCharles Goddard , Shamane Siriwardhana , Malikeh Ehghaghi , Luke Meyers , Vlad Karpukhin , Brian Benedict , Mark McQuade , Jacob SolawetzComments: 11 pages, 4 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at this https URL .
- [1811] arXiv:2403.13269 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large ModelsComments: 5 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We present a novel Parameter-Efficient Fine-Tuning (PEFT) method, dubbed as Adaptive Freezing of Low Rank Adaptation (AFLoRA). Specifically, for each pre-trained frozen weight tensor, we add a parallel path of trainable low-rank matrices, namely a down-projection and an up-projection matrix, each of which is followed by a feature transformation vector. Based on a novel freezing score, we the incrementally freeze these projection matrices during fine-tuning to reduce the computation and alleviate over-fitting. Our experimental results demonstrate that we can achieve state-of-the-art performance with an average improvement of up to $0.85\%$ as evaluated on GLUE benchmark while yeilding up to $9.5\times$ fewer average trainable parameters. While compared in terms of runtime, AFLoRA can yield up to $1.86\times$ improvement as opposed to similar PEFT alternatives. Besides the practical utility of our approach, we provide insights on the trainability requirements of LoRA paths at different modules and the freezing schedule for the different projection matrices. Code will be released.
- [1812] arXiv:2403.13293 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Building Optimal Neural Architectures using Interpretable KnowledgeKeith G. Mills , Fred X. Han , Mohammad Salameh , Shengyao Lu , Chunhua Zhou , Jiao He , Fengyu Sun , Di NiuComments: CVPR'24; 18 Pages, 18 Figures, 3 TablesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Neural Architecture Search is a costly practice. The fact that a search space can span a vast number of design choices with each architecture evaluation taking nontrivial overhead makes it hard for an algorithm to sufficiently explore candidate networks. In this paper, we propose AutoBuild, a scheme which learns to align the latent embeddings of operations and architecture modules with the ground-truth performance of the architectures they appear in. By doing so, AutoBuild is capable of assigning interpretable importance scores to architecture modules, such as individual operation features and larger macro operation sequences such that high-performance neural networks can be constructed without any need for search. Through experiments performed on state-of-the-art image classification, segmentation, and Stable Diffusion models, we show that by mining a relatively small set of evaluated architectures, AutoBuild can learn to build high-quality architectures directly or help to reduce search space to focus on relevant areas, finding better architectures that outperform both the original labeled ones and ones found by search baselines. Code available at this https URL
- [1813] arXiv:2403.13309 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment ProposalComments: 10 pages, 1 figure, 3 tablesSubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: The rapid integration of Large Language Models (LLMs) across diverse sectors has marked a transformative era, showcasing remarkable capabilities in text generation and problem-solving tasks. However, this technological advancement is accompanied by significant risks and vulnerabilities. Despite ongoing security enhancements, attackers persistently exploit these weaknesses, casting doubts on the overall trustworthiness of LLMs. Compounding the issue, organisations are deploying LLM-integrated systems without understanding the severity of potential consequences. Existing studies by OWASP and MITRE offer a general overview of threats and vulnerabilities but lack a method for directly and succinctly analysing the risks for security practitioners, developers, and key decision-makers who are working with this novel technology. To address this gap, we propose a risk assessment process using tools like the OWASP risk rating methodology which is used for traditional systems. We conduct scenario analysis to identify potential threat agents and map the dependent system components against vulnerability factors. Through this analysis, we assess the likelihood of a cyberattack. Subsequently, we conduct a thorough impact analysis to derive a comprehensive threat matrix. We also map threats against three key stakeholder groups: developers engaged in model fine-tuning, application developers utilizing third-party APIs, and end users. The proposed threat matrix provides a holistic evaluation of LLM-related risks, enabling stakeholders to make informed decisions for effective mitigation strategies. Our outlined process serves as an actionable and comprehensive tool for security practitioners, offering insights for resource management and enhancing the overall system security.
- [1814] arXiv:2403.13334 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Hyacinth6B: A large language model for Traditional ChineseComments: 14pagesSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: This research's primary motivation of this study is to address the high hardware and computational demands typically associated with LLMs.Therefore,our goal is to find a balance between model lightness and performance,striving to maximize performance while using a comparatively lightweight model. Hyacinth6B was developed with this objective in mind,aiming to fully leverage the core capabilities of LLMs without incurring substantial resource costs, effectively pushing the boundaries of smaller model's performance. The training approach involves parameter efficient finetuning using the LoRA method.
- [1815] arXiv:2403.13335 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Adaptive Ensembles of Fine-Tuned Transformers for LLM-Generated Text DetectionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have reached human-like proficiency in generating diverse textual content, underscoring the necessity for effective fake text detection to avoid potential risks such as fake news in social media. Previous research has mostly tested single models on in-distribution datasets, limiting our understanding of how these models perform on different types of data for LLM-generated text detection task. We researched this by testing five specialized transformer-based models on both in-distribution and out-of-distribution datasets to better assess their performance and generalizability. Our results revealed that single transformer-based classifiers achieved decent performance on in-distribution dataset but limited generalization ability on out-of-distribution dataset. To improve it, we combined the individual classifiers models using adaptive ensemble algorithms, which improved the average accuracy significantly from 91.8% to 99.2% on an in-distribution test set and from 62.9% to 72.5% on an out-of-distribution test set. The results indicate the effectiveness, good generalization ability, and great potential of adaptive ensemble algorithms in LLM-generated text detection.
- [1816] arXiv:2403.13337 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning Novel View Synthesis from Heterogeneous Low-light CapturesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Neural radiance field has achieved fundamental success in novel view synthesis from input views with the same brightness level captured under fixed normal lighting. Unfortunately, synthesizing novel views remains to be a challenge for input views with heterogeneous brightness level captured under low-light condition. The condition is pretty common in the real world. It causes low-contrast images where details are concealed in the darkness and camera sensor noise significantly degrades the image quality. To tackle this problem, we propose to learn to decompose illumination, reflectance, and noise from input views according to that reflectance remains invariant across heterogeneous views. To cope with heterogeneous brightness and noise levels across multi-views, we learn an illumination embedding and optimize a noise map individually for each view. To allow intuitive editing of the illumination, we design an illumination adjustment module to enable either brightening or darkening of the illumination component. Comprehensive experiments demonstrate that this approach enables effective intrinsic decomposition for low-light multi-view noisy images and achieves superior visual quality and numerical performance for synthesizing novel views compared to state-of-the-art methods.
- [1817] arXiv:2403.13341 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image AnalysisSantosh Sanjeev , Nuren Zhaksylyk , Ibrahim Almakky , Anees Ur Rehman Hashmi , Mohammad Areeb Qazi , Mohammad YaqubSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: The scarcity of well-annotated medical datasets requires leveraging transfer learning from broader datasets like ImageNet or pre-trained models like CLIP. Model soups averages multiple fine-tuned models aiming to improve performance on In-Domain (ID) tasks and enhance robustness against Out-of-Distribution (OOD) datasets. However, applying these methods to the medical imaging domain faces challenges and results in suboptimal performance. This is primarily due to differences in error surface characteristics that stem from data complexities such as heterogeneity, domain shift, class imbalance, and distributional shifts between training and testing phases. To address this issue, we propose a hierarchical merging approach that involves local and global aggregation of models at various levels based on models' hyperparameter configurations. Furthermore, to alleviate the need for training a large number of models in the hyperparameter search, we introduce a computationally efficient method using a cyclical learning rate scheduler to produce multiple models for aggregation in the weight space. Our method demonstrates significant improvements over the model souping approach across multiple datasets (around 6% gain in HAM10000 and CheXpert datasets) while maintaining low computational costs for model generation and selection. Moreover, we achieve better results on OOD datasets than model soups. The code is available at this https URL .
- [1818] arXiv:2403.13344 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: USE: Dynamic User Modeling with Stateful Sequence ModelsZhihan Zhou , Qixiang Fang , Leonardo Neves , Francesco Barbieri , Yozen Liu , Han Liu , Maarten W. Bos , Ron DotschSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Abstract: User embeddings play a crucial role in user engagement forecasting and personalized services. Recent advances in sequence modeling have sparked interest in learning user embeddings from behavioral data. Yet behavior-based user embedding learning faces the unique challenge of dynamic user modeling. As users continuously interact with the apps, user embeddings should be periodically updated to account for users' recent and long-term behavior patterns. Existing methods highly rely on stateless sequence models that lack memory of historical behavior. They have to either discard historical data and use only the most recent data or reprocess the old and new data jointly. Both cases incur substantial computational overhead. To address this limitation, we introduce User Stateful Embedding (USE). USE generates user embeddings and reflects users' evolving behaviors without the need for exhaustive reprocessing by storing previous model states and revisiting them in the future. Furthermore, we introduce a novel training objective named future W-behavior prediction to transcend the limitations of next-token prediction by forecasting a broader horizon of upcoming user behaviors. By combining it with the Same User Prediction, a contrastive learning-based objective that predicts whether different segments of behavior sequences belong to the same user, we further improve the embeddings' distinctiveness and representativeness. We conducted experiments on 8 downstream tasks using Snapchat users' behavioral logs in both static (i.e., fixed user behavior sequences) and dynamic (i.e., periodically updated user behavior sequences) settings. We demonstrate USE's superior performance over established baselines. The results underscore USE's effectiveness and efficiency in integrating historical and recent user behavior sequences into user embeddings in dynamic user modeling.
- [1819] arXiv:2403.13355 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: BadEdit: Backdooring large language models by model editingYanzhou Li , Tianlin Li , Kangjie Chen , Jian Zhang , Shangqing Liu , Wenhan Wang , Tianwei Zhang , Yang LiuComments: ICLR 2024Subjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning. Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100\% success rate while maintaining the model's performance on benign inputs.
- [1820] arXiv:2403.13362 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: Incentivizing News Consumption on Social Media Platforms Using Large Language Models and Realistic Bot AccountsHadi Askari , Anshuman Chhabra , Bernhard Clemm von Hohenberg , Michael Heseltine , Magdalena WojcieszakSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Polarization, declining trust, and wavering support for democratic norms are pressing threats to U.S. democracy. Exposure to verified and quality news may lower individual susceptibility to these threats and make citizens more resilient to misinformation, populism, and hyperpartisan rhetoric. This project examines how to enhance users' exposure to and engagement with verified and ideologically balanced news in an ecologically valid setting. We rely on a large-scale two-week long field experiment (from 1/19/2023 to 2/3/2023) on 28,457 Twitter users. We created 28 bots utilizing GPT-2 that replied to users tweeting about sports, entertainment, or lifestyle with a contextual reply containing two hardcoded elements: a URL to the topic-relevant section of quality news organization and an encouragement to follow its Twitter account. To further test differential effects by gender of the bots, treated users were randomly assigned to receive responses by bots presented as female or male. We examine whether our over-time intervention enhances the following of news media organization, the sharing and the liking of news content and the tweeting about politics and the liking of political content. We find that the treated users followed more news accounts and the users in the female bot treatment were more likely to like news content than the control. Most of these results, however, were small in magnitude and confined to the already politically interested Twitter users, as indicated by their pre-treatment tweeting about politics. These findings have implications for social media and news organizations, and also offer direction for future work on how Large Language Models and other computational interventions can effectively enhance individual on-platform engagement with quality news and public affairs.
- [1821] arXiv:2403.13368 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Computational Models to Study Language Processing in the Human Brain: A SurveySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Despite differing from the human language processing mechanism in implementation and algorithms, current language models demonstrate remarkable human-like or surpassing language capabilities. Should computational language models be employed in studying the brain, and if so, when and how? To delve into this topic, this paper reviews efforts in using computational models for brain research, highlighting emerging trends. To ensure a fair comparison, the paper evaluates various computational models using consistent metrics on the same dataset. Our analysis reveals that no single model outperforms others on all datasets, underscoring the need for rich testing datasets and rigid experimental control to draw robust conclusions in studies involving computational models.
- [1822] arXiv:2403.13369 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and PromptingPhillip Richter-Pechanski , Philipp Wiesenbach , Dominic M. Schwab , Christina Kiriakou , Nicolas Geis , Christoph Dieterich , Anette FrankSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, which are suited for well-established interpretability methods. We are first to present a systematic evaluation of these methods in a low-resource setting, by performing multi-class section classification on German doctor's letters. We conduct extensive class-wise evaluations supported by Shapley values, to validate the quality of our small training data set and to ensure the interpretability of model predictions. We demonstrate that a lightweight, domain-adapted pretrained model, prompted with just 20 shots, outperforms a traditional classification model by 30.5% accuracy. Our results serve as a process-oriented guideline for clinical information extraction projects working with low-resource.
- [1823] arXiv:2403.13372 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language ModelsComments: 12 pages, preprintSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present LlamaFactory, a unified framework that integrates a suite of cutting-edge efficient training methods. It allows users to flexibly customize the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard. We empirically validate the efficiency and effectiveness of our framework on language modeling and text generation tasks. It has been released at this https URL and already received over 13,000 stars and 1,600 forks.
- [1824] arXiv:2403.13374 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Byzantine-resilient Federated Learning With Adaptivity to Data HeterogeneitySubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: This paper deals with federated learning (FL) in the presence of malicious Byzantine attacks and data heterogeneity. A novel Robust Average Gradient Algorithm (RAGA) is proposed, which leverages the geometric median for aggregation and can freely select the round number for local updating. Different from most existing resilient approaches, which perform convergence analysis based on strongly-convex loss function or homogeneously distributed dataset, we conduct convergence analysis for not only strongly-convex but also non-convex loss function over heterogeneous dataset. According to our theoretical analysis, as long as the fraction of dataset from malicious users is less than half, RAGA can achieve convergence at rate $\mathcal{O}({1}/{T^{2/3- \delta}})$ where $T$ is the iteration number and $\delta \in (0, 2/3)$ for non-convex loss function, and at linear rate for strongly-convex loss function. Moreover, stationary point or global optimal solution is proved to obtainable as data heterogeneity vanishes. Experimental results corroborate the robustness of RAGA to Byzantine attacks and verifies the advantage of RAGA over baselines on convergence performance under various intensity of Byzantine attacks, for heterogeneous dataset.
- [1825] arXiv:2403.13405 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: DOR3D-Net: Dense Ordinal Regression Network for 3D Hand Pose EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Depth-based 3D hand pose estimation is an important but challenging research task in human-machine interaction community. Recently, dense regression methods have attracted increasing attention in 3D hand pose estimation task, which provide a low computational burden and high accuracy regression way by densely regressing hand joint offset maps. However, large-scale regression offset values are often affected by noise and outliers, leading to a significant drop in accuracy. To tackle this, we re-formulate 3D hand pose estimation as a dense ordinal regression problem and propose a novel Dense Ordinal Regression 3D Pose Network (DOR3D-Net). Specifically, we first decompose offset value regression into sub-tasks of binary classifications with ordinal constraints. Then, each binary classifier can predict the probability of a binary spatial relationship relative to joint, which is easier to train and yield much lower level of noise. The estimated hand joint positions are inferred by aggregating the ordinal regression results at local positions with a weighted sum. Furthermore, both joint regression loss and ordinal regression loss are used to train our DOR3D-Net in an end-to-end manner. Extensive experiments on public datasets (ICVL, MSRA, NYU and HANDS2017) show that our design provides significant improvements over SOTA methods.
- [1826] arXiv:2403.13408 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: S2DM: Sector-Shaped Diffusion Models for Video GenerationComments: 17 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features. In this work, we propose a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point. S2DM can generate a group of intrinsically related data sharing the same semantic and stochastic features while varying on temporal features with appropriate guided conditions. We apply S2DM to video generation tasks, and explore the use of optical flow as temporal conditions. Our experimental results show that S2DM outperforms many existing methods in the task of video generation without any temporal-feature modelling modules. For text-to-video generation tasks where temporal conditions are not explicitly given, we propose a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works. Our results can be viewd at this https URL .
- [1827] arXiv:2403.13421 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Caching-Augmented Lifelong Multi-Agent Path FindingSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Abstract: Multi-Agent Path Finding (MAPF), which involves finding collision-free paths for multiple robots, is crucial in various applications. Lifelong MAPF, where targets are reassigned to agents as soon as they complete their initial targets, offers a more accurate approximation of real-world warehouse planning. In this paper, we present a novel mechanism named Caching-Augmented Lifelong MAPF (CAL-MAPF), designed to improve the performance of Lifelong MAPF. We have developed a new type of map grid called cache for temporary item storage and replacement, and created a locking mechanism to improve the planning solution's stability. A task assigner (TA) is designed for CAL-MAPF to allocate target locations to agents and control agent status in different situations. CAL-MAPF has been evaluated using various cache replacement policies and input task distributions. We have identified three main factors significantly impacting CAL-MAPF performance through experimentation: suitable input task distribution, high cache hit rate, and smooth traffic. In general, CAL-MAPF has demonstrated potential for performance improvements in certain task distributions, map and agent configurations.
- [1828] arXiv:2403.13479 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Deepfake Detection without Deepfakes: Generalization via Synthetic Frequency Patterns InjectionDavide Alessandro Coccomini , Roberto Caldelli , Claudio Gennaro , Giuseppe Fiameni , Giuseppe Amato , Fabrizio FalchiSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Deepfake detectors are typically trained on large sets of pristine and generated images, resulting in limited generalization capacity; they excel at identifying deepfakes created through methods encountered during training but struggle with those generated by unknown techniques. This paper introduces a learning approach aimed at significantly enhancing the generalization capabilities of deepfake detectors. Our method takes inspiration from the unique "fingerprints" that image generation processes consistently introduce into the frequency domain. These fingerprints manifest as structured and distinctly recognizable frequency patterns. We propose to train detectors using only pristine images injecting in part of them crafted frequency patterns, simulating the effects of various deepfake generation techniques without being specific to any. These synthetic patterns are based on generic shapes, grids, or auras. We evaluated our approach using diverse architectures across 25 different generation methods. The models trained with our approach were able to perform state-of-the-art deepfake detection, demonstrating also superior generalization capabilities in comparison with previous methods. Indeed, they are untied to any specific generation technique and can effectively identify deepfakes regardless of how they were made.
- [1829] arXiv:2403.13501 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: VSTAR: Generative Temporal Nursing for Longer Dynamic Video SynthesisComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract: Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.
- [1830] arXiv:2403.13512 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Scale Decoupled DistillationComments: Accepted to CVPR2024 10 pages 6figureSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end, we propose a simple but effective method, i.e., Scale Decoupled Distillation (SDD), for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover, the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity, respectively. By increasing the weight of complementary parts, SDD can guide the student to focus more on ambiguous samples, improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs, especially in the fine-grained classification task. Code is available at: this https URL
- [1831] arXiv:2403.13513 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal ModelsComments: under review, code available: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: This paper presents a way of enhancing the reliability of Large Multimodal Models (LMMs) in addressing hallucination effects, where models generate incorrect or unrelated responses. Without additional instruction tuning paradigm, we introduce Counterfactual Inception, a novel method that implants counterfactual thoughts into LMMs using carefully chosen, misaligned counterfactual keywords. This method is grounded in the concept of counterfactual thinking, a cognitive process where humans consider alternative realities and outcomes. By applying this human-like reasoning mechanism to LMMs, we aim to reduce hallucination effects and improve the models' trustworthiness. We also propose Dual-modality Verification Process (DVP), a rigorous framework for selecting optimal counterfactual keywords to trigger counterfactual thinking into LMMs, concurrently considering visual and linguistic context. Our extensive experiments across various LMMs, including both open-source and proprietary models, corroborate that our method significantly mitigates hallucination phenomena across different datasets.
- [1832] arXiv:2403.13523 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Have You Poisoned My Data? Defending Neural Networks against Data PoisoningComments: Paper accepted for publication at European Symposium on Research in Computer Security (ESORICS) 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: The unprecedented availability of training data fueled the rapid development of powerful neural networks in recent years. However, the need for such large amounts of data leads to potential threats such as poisoning attacks: adversarial manipulations of the training data aimed at compromising the learned model to achieve a given adversarial goal.
This paper investigates defenses against clean-label poisoning attacks and proposes a novel approach to detect and filter poisoned datapoints in the transfer learning setting. We define a new characteristic vector representation of datapoints and show that it effectively captures the intrinsic properties of the data distribution. Through experimental analysis, we demonstrate that effective poisons can be successfully differentiated from clean points in the characteristic vector space. We thoroughly evaluate our proposed approach and compare it to existing state-of-the-art defenses using multiple architectures, datasets, and poison budgets. Our evaluation shows that our proposal outperforms existing approaches in defense rate and final trained model performance across all experimental settings. - [1833] arXiv:2403.13524 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Compress3D: a Compressed Latent Space for 3D Generation from a Single ImageSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: 3D generation has witnessed significant advancements, yet efficiently producing high-quality 3D assets from a single image remains challenging. In this paper, we present a triplane autoencoder, which encodes 3D models into a compact triplane latent space to effectively compress both the 3D geometry and texture information. Within the autoencoder framework, we introduce a 3D-aware cross-attention mechanism, which utilizes low-resolution latent representations to query features from a high-resolution 3D feature volume, thereby enhancing the representation capacity of the latent space. Subsequently, we train a diffusion model on this refined latent space. In contrast to solely relying on image embedding for 3D generation, our proposed method advocates for the simultaneous utilization of both image embedding and shape embedding as conditions. Specifically, the shape embedding is estimated via a diffusion prior model conditioned on the image embedding. Through comprehensive experiments, we demonstrate that our method outperforms state-of-the-art algorithms, achieving superior performance while requiring less training data and time. Our approach enables the generation of high-quality 3D assets in merely 7 seconds on a single A100 GPU.
- [1834] arXiv:2403.13537 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: What explains the success of cross-modal fine-tuning with ORCA?Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA's success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.
- [1835] arXiv:2403.13553 (cross-list from cs.HC) [ pdf , ps , other ]
-
Title: VCounselor: A Psychological Intervention Chat Agent Based on a Knowledge-Enhanced Large Language ModelComments: 24 pages, 6 figuresSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Conversational artificial intelligence can already independently engage in brief conversations with clients with psychological problems and provide evidence-based psychological interventions. The main objective of this study is to improve the effectiveness and credibility of the large language model in psychological intervention by creating a specialized agent, the VCounselor, to address the limitations observed in popular large language models such as ChatGPT in domain applications. We achieved this goal by proposing a new affective interaction structure and knowledge-enhancement structure. In order to evaluate VCounselor, this study compared the general large language model, the fine-tuned large language model, and VCounselor's knowledge-enhanced large language model. At the same time, the general large language model and the fine-tuned large language model will also be provided with an avatar to compare them as an agent with VCounselor. The comparison results indicated that the affective interaction structure and knowledge-enhancement structure of VCounselor significantly improved the effectiveness and credibility of the psychological intervention, and VCounselor significantly provided positive tendencies for clients' emotions. The conclusion of this study strongly supports that VConselor has a significant advantage in providing psychological support to clients by being able to analyze the patient's problems with relative accuracy and provide professional-level advice that enhances support for clients.
- [1836] arXiv:2403.13556 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban EnvironmentsSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available in the supplementary material.
- [1837] arXiv:2403.13574 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: A Large Language Model Enhanced Sequential Recommender for Joint Video and Comment RecommendationBowen Zheng , Zihan Lin , Enze Liu , Chen Yang , Enyang Bai , Cheng Ling , Wayne Xin Zhao , Ji-Rong WenSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI)
Abstract: In online video platforms, reading or writing comments on interesting videos has become an essential part of the video watching experience. However, existing video recommender systems mainly model users' interaction behaviors with videos, lacking consideration of comments in user behavior modeling. In this paper, we propose a novel recommendation approach called LSVCR by leveraging user interaction histories with both videos and comments, so as to jointly conduct personalized video and comment recommendation. Specifically, our approach consists of two key components, namely sequential recommendation (SR) model and supplemental large language model (LLM) recommender. The SR model serves as the primary recommendation backbone (retained in deployment) of our approach, allowing for efficient user preference modeling. Meanwhile, we leverage the LLM recommender as a supplemental component (discarded in deployment) to better capture underlying user preferences from heterogeneous interaction behaviors. In order to integrate the merits of the SR model and the supplemental LLM recommender, we design a twostage training paradigm. The first stage is personalized preference alignment, which aims to align the preference representations from both components, thereby enhancing the semantics of the SR model. The second stage is recommendation-oriented fine-tuning, in which the alignment-enhanced SR model is fine-tuned according to specific objectives. Extensive experiments in both video and comment recommendation tasks demonstrate the effectiveness of LSVCR. Additionally, online A/B testing on the KuaiShou platform verifies the actual benefits brought by our approach. In particular, we achieve a significant overall gain of 4.13% in comment watch time.
- [1838] arXiv:2403.13597 (cross-list from cs.DB) [ pdf , ps , html , other ]
-
Title: No more optimization rules: LLM-enabled policy-based multi-modal query optimizerComments: Yifan and Haodi contribute equally to the workSubjects: Databases (cs.DB) ; Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract: Large language model (LLM) has marked a pivotal moment in the field of machine learning and deep learning. Recently its capability for query planning has been investigated, including both single-modal and multi-modal queries. However, there is no work on the query optimization capability of LLM. As a critical (or could even be the most important) step that significantly impacts the execution performance of the query plan, such analysis and attempts should not be missed. From another aspect, existing query optimizers are usually rule-based or rule-based + cost-based, i.e., they are dependent on manually created rules to complete the query plan rewrite/transformation. Given the fact that modern optimizers include hundreds to thousands of rules, designing a multi-modal query optimizer following a similar way is significantly time-consuming since we will have to enumerate as many multi-modal optimization rules as possible, which has not been well addressed today. In this paper, we investigate the query optimization ability of LLM and use LLM to design LaPuda, a novel LLM and Policy based multi-modal query optimizer. Instead of enumerating specific and detailed rules, LaPuda only needs a few abstract policies to guide LLM in the optimization, by which much time and human effort are saved. Furthermore, to prevent LLM from making mistakes or negative optimization, we borrow the idea of gradient descent and propose a guided cost descent (GCD) algorithm to perform the optimization, such that the optimization can be kept in the correct direction. In our evaluation, our methods consistently outperform the baselines in most cases. For example, the optimized plans generated by our methods result in 1~3x higher execution speed than those by the baselines.
- [1839] arXiv:2403.13619 (cross-list from cs.DC) [ pdf , ps , other ]
-
Title: Dynamic Resource Allocation for Virtual Machine Migration Optimization using Machine LearningSubjects: Distributed, Parallel, and Cluster Computing (cs.DC) ; Artificial Intelligence (cs.AI)
Abstract: The paragraph is grammatically correct and logically coherent. It discusses the importance of mobile terminal cloud computing migration technology in meeting the demands of evolving computer and cloud computing technologies. It emphasizes the need for efficient data access and storage, as well as the utilization of cloud computing migration technology to prevent additional time delays. The paragraph also highlights the contributions of cloud computing migration technology to expanding cloud computing services. Additionally, it acknowledges the role of virtualization as a fundamental capability of cloud computing while emphasizing that cloud computing and virtualization are not inherently interconnected. Finally, it introduces machine learning-based virtual machine migration optimization and dynamic resource allocation as a critical research direction in cloud computing, citing the limitations of static rules or manual settings in traditional cloud computing environments. Overall, the paragraph effectively communicates the importance of machine learning technology in addressing resource allocation and virtual machine migration challenges in cloud computing.
- [1840] arXiv:2403.13653 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning User Embeddings from Human Gaze for Personalised Saliency PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.
- [1841] arXiv:2403.13681 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: PARAMANU-AYN: An Efficient Novel Generative and Instruction-tuned Language Model for Indian Legal Case DocumentsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: In this paper, we present PARAMANU-AYN, a language model based exclusively on case documents of the Supreme Court of India, the Constitution of India, and the Indian Penal Code. The novel Auto Regressive (AR) decoder based model is pretrained from scratch at a context size of 8192. We evaluated our pretrained legal model on perplexity metrics. We also instruction-tuned our pretrained model on a set of 10,763 instructions covering various legal tasks such as legal reasoning, judgement explanation, legal clause generation, legal drafting, legal contract drafting, case summarization, constitutional question-answering, etc. We also evaluated the responses of prompts for instruction-tuned models by GPT-3.5-Turbo on clarity, relevance, completeness, and legal reasoning metrics in a scale of 10. Our model can be run on CPU and achieved 42.46 tokens/sec CPU inference speed. We found that our models, despite not being pretrained on legal books, various legal contracts, and legal documents, were able to learn the domain knowledge required for drafting various legal contracts and legal clauses, and generalize to draft legal contracts and legal clauses with limited instruction tuning. Hence, we conclude that for a strong domain-specialized generative language model (such as legal), very large amounts of data are not required to develop models from scratch. We believe that this work is the first attempt to make a dedicated generative legal language model from scratch for Indian Supreme Court jurisdiction or in legal NLP overall. We plan to release our Paramanu-Ayn model at this https URL .
- [1842] arXiv:2403.13682 (cross-list from cs.CR) [ pdf , ps , html , other ]
-
Title: Threats, Attacks, and Defenses in Machine Unlearning: A SurveySubjects: Cryptography and Security (cs.CR) ; Artificial Intelligence (cs.AI)
Abstract: Machine Unlearning (MU) has gained considerable attention recently for its potential to achieve Safe AI by removing the influence of specific data from trained machine learning models. This process, known as knowledge removal, addresses AI governance concerns of training data such as quality, sensitivity, copyright restrictions, and obsolescence. This capability is also crucial for ensuring compliance with privacy regulations such as the Right To Be Forgotten. Furthermore, effective knowledge removal mitigates the risk of harmful outcomes, safeguarding against biases, misinformation, and unauthorized data exploitation, thereby enhancing the safe and responsible use of AI systems. Efforts have been made to design efficient unlearning approaches, with MU services being examined for integration with existing machine learning as a service, allowing users to submit requests to remove specific data from the training corpus. However, recent research highlights vulnerabilities in machine unlearning systems, such as information leakage and malicious unlearning requests, that can lead to significant security and privacy concerns. Moreover, extensive research indicates that unlearning methods and prevalent attacks fulfill diverse roles within MU systems. For instance, unlearning can act as a mechanism to recover models from backdoor attacks, while backdoor attacks themselves can serve as an evaluation metric for unlearning effectiveness. This underscores the intricate relationship and complex interplay among these mechanisms in maintaining system functionality and safety. This survey aims to fill the gap between the extensive number of studies on threats, attacks, and defenses in machine unlearning and the absence of a comprehensive review that categorizes their taxonomy, methods, and solutions, thus offering valuable insights for future research directions and practical implementations.
- [1843] arXiv:2403.13684 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt TuningComments: Accepted as a conference paper at ICLR 2024; Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Generalized Category Discovery (GCD) aims to classify unlabelled images from both `seen' and `unseen' classes by transferring knowledge from a set of labelled `seen' class images. A key theme in existing GCD approaches is adapting large-scale pre-trained models for the GCD task. An alternate perspective, however, is to adapt the data representation itself for better alignment with the pre-trained model. As such, in this paper, we introduce a two-stage adaptation approach termed SPTNet, which iteratively optimizes model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning). Furthermore, we propose a novel spatial prompt tuning method (SPT) which considers the spatial property of image data, enabling the method to better focus on object parts, which can transfer between seen and unseen classes. We thoroughly evaluate our SPTNet on standard benchmarks and demonstrate that our method outperforms existing GCD methods. Notably, we find our method achieves an average accuracy of 61.4% on the SSB, surpassing prior state-of-the-art methods by approximately 10%. The improvement is particularly remarkable as our method yields extra parameters amounting to only 0.117% of those in the backbone architecture. Project page: this https URL .
- [1844] arXiv:2403.13703 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Fostc3net:A Lightweight YOLOv5 Based On the Network Structure OptimizationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Transmission line detection technology is crucial for automatic monitoring and ensuring the safety of electrical facilities. The YOLOv5 series is currently one of the most advanced and widely used methods for object detection. However, it faces inherent challenges, such as high computational load on devices and insufficient detection accuracy. To address these concerns, this paper presents an enhanced lightweight YOLOv5 technique customized for mobile devices, specifically intended for identifying objects associated with transmission lines. The C3Ghost module is integrated into the convolutional network of YOLOv5 to reduce floating point operations per second (FLOPs) in the feature channel fusion process and improve feature expression performance. In addition, a FasterNet module is introduced to replace the c3 module in the YOLOv5 Backbone. The FasterNet module uses Partial Convolutions to process only a portion of the input channels, improving feature extraction efficiency and reducing computational overhead. To address the imbalance between simple and challenging samples in the dataset and the diversity of aspect ratios of bounding boxes, the wIoU v3 LOSS is adopted as the loss function. To validate the performance of the proposed approach, Experiments are conducted on a custom dataset of transmission line poles. The results show that the proposed model achieves a 1% increase in detection accuracy, a 13% reduction in FLOPs, and a 26% decrease in model parameters compared to the existing this http URL the ablation experiment, it was also discovered that while the Fastnet module and the CSghost module improved the precision of the original YOLOv5 baseline model, they caused a decrease in the mAP@.5-.95 metric. However, the improvement of the wIoUv3 loss function significantly mitigated the decline of the mAP@.5-.95 metric.
- [1845] arXiv:2403.13721 (cross-list from cs.NI) [ pdf , ps , html , other ]
-
Title: Large Language Models meet Network Slicing Management and OrchestrationAbdulhalim Dandoush (1 and 2), Viswanath Kumarskandpriya (1), Mueen Uddin (2), Usman Khalil (3) ((1) Esme Research Lab, SA ESME, Ivry-Sur-Seine, France, (2) University of Doha for Science and Technology (UDST), Doha, Qatar, (3) University Brunei Darussalam, Brunei Darrussalam)Subjects: Networking and Internet Architecture (cs.NI) ; Artificial Intelligence (cs.AI)
Abstract: Network slicing, a cornerstone technology for future networks, enables the creation of customized virtual networks on a shared physical infrastructure. This fosters innovation and agility by providing dedicated resources tailored to specific applications. However, current orchestration and management approaches face limitations in handling the complexity of new service demands within multi-administrative domain environments. This paper proposes a future vision for network slicing powered by Large Language Models (LLMs) and multi-agent systems, offering a framework that can be integrated with existing Management and Orchestration (MANO) frameworks. This framework leverages LLMs to translate user intent into technical requirements, map network functions to infrastructure, and manage the entire slice lifecycle, while multi-agent systems facilitate collaboration across different administrative domains. We also discuss the challenges associated with implementing this framework and potential solutions to mitigate them.
- [1846] arXiv:2403.13728 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization via Multiplier Induced Loss Landscape SchedulingXudong Sun , Nutan Chen , Alexej Gossmann , Yu Xing , Carla Feistner , Emilio Dorigatt , Felix Drost , Daniele Scarcella , Lisa Beer , Carsten MarrSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: We address the online combinatorial choice of weight multipliers for multi-objective optimization of many loss terms parameterized by neural works via a probabilistic graphical model (PGM) for the joint model parameter and multiplier evolution process, with a hypervolume based likelihood promoting multi-objective descent. The corresponding parameter and multiplier estimation as a sequential decision process is then cast into an optimal control problem, where the multi-objective descent goal is dispatched hierarchically into a series of constraint optimization sub-problems. The subproblem constraint automatically adapts itself according to Pareto dominance and serves as the setpoint for the low level multiplier controller to schedule loss landscapes via output feedback of each loss term. Our method is multiplier-free and operates at the timescale of epochs, thus saves tremendous computational resources compared to full training cycle multiplier tuning. It also circumvents the excessive memory requirements and heavy computational burden of existing multi-objective deep learning methods. We applied it to domain invariant variational auto-encoding with 6 loss terms on the PACS domain generalization task, and observed robust performance across a range of controller hyperparameters, as well as different multiplier initial conditions, outperforming other multiplier scheduling methods. We offered modular implementation of our method, admitting extension to custom definition of many loss terms.
- [1847] arXiv:2403.13729 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Reinforcement Learning for Online Testing of Autonomous Driving Systems: a Replication and Extension StudySubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Abstract: In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extension of that empirical study. Our replication shows that RL does not outperform pure random test generation in a comparison conducted under the same settings of the original study, but with no confounding factor coming from the way collisions are measured. Our extension aims at eliminating some of the possible reasons for the poor performance of RL observed in our replication: (1) the presence of reward components providing contrasting or useless feedback to the RL agent; (2) the usage of an RL algorithm (Q-learning) which requires discretization of an intrinsically continuous state space. Results show that our new RL agent is able to converge to an effective policy that outperforms random testing. Results also highlight other possible improvements, which open to further investigations on how to best leverage RL for online ADS testing.
- [1848] arXiv:2403.13731 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Emotion Recognition Using Transformers with Masked LearningSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: In recent years, deep learning has achieved innovative advancements in various fields, including the analysis of human emotions and behaviors. Initiatives such as the Affective Behavior Analysis in-the-wild (ABAW) competition have been particularly instrumental in driving research in this area by providing diverse and challenging datasets that enable precise evaluation of complex emotional states. This study leverages the Vision Transformer (ViT) and Transformer models to focus on the estimation of Valence-Arousal (VA), which signifies the positivity and intensity of emotions, recognition of various facial expressions, and detection of Action Units (AU) representing fundamental muscle movements. This approach transcends traditional Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) based methods, proposing a new Transformer-based framework that maximizes the understanding of temporal and spatial features. The core contributions of this research include the introduction of a learning technique through random frame masking and the application of Focal loss adapted for imbalanced data, enhancing the accuracy and applicability of emotion and behavior analysis in real-world settings. This approach is expected to contribute to the advancement of emotional computing and deep learning methodologies.
- [1849] arXiv:2403.13741 (cross-list from cs.MA) [ pdf , ps , other ]
-
Title: Hyper Strategy LogicComments: AAMAS 2024Subjects: Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Abstract: Strategy logic (SL) is a powerful temporal logic that enables strategic reasoning in multi-agent systems. SL supports explicit (first-order) quantification over strategies and provides a logical framework to express many important properties such as Nash equilibria, dominant strategies, etc. While in SL the same strategy can be used in multiple strategy profiles, each such profile is evaluated w.r.t. a path-property, i.e., a property that considers the single path resulting from a particular strategic interaction. In this paper, we present Hyper Strategy Logic (HyperSL), a strategy logic where the outcome of multiple strategy profiles can be compared w.r.t. a hyperproperty, i.e., a property that relates multiple paths. We show that HyperSL can capture important properties that cannot be expressed in SL, including non-interference, quantitative Nash equilibria, optimal adversarial planning, and reasoning under imperfect information. On the algorithmic side, we identify an expressive fragment of HyperSL with decidable model checking and present a model-checking algorithm. We contribute a prototype implementation of our algorithm and report on encouraging experimental results.
- [1850] arXiv:2403.13765 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Towards Principled Representation Learning from Videos for Reinforcement LearningComments: ICLR 2024 Spotlight Conference PaperSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a theoretical understanding remains absent. We initiate the theoretical investigation into principled approaches for representation learning and focus on learning the latent state representations of the underlying MDP using video data. We study two types of settings: one where there is iid noise in the observation, and a more challenging setting where there is also the presence of exogenous noise, which is non-iid noise that is temporally correlated, such as the motion of people or cars in the background. We study three commonly used approaches: autoencoding, temporal contrastive learning, and forward modeling. We prove upper bounds for temporal contrastive learning and forward modeling in the presence of only iid noise. We show that these approaches can learn the latent state and use it to do efficient downstream RL with polynomial sample complexity. When exogenous noise is also present, we establish a lower bound result showing that the sample complexity of learning from video data can be exponentially worse than learning from action-labeled trajectory data. This partially explains why reinforcement learning with video pre-training is hard. We evaluate these representational learning methods in two visual domains, yielding results that are consistent with our theoretical findings.
- [1851] arXiv:2403.13780 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Information-Theoretic Distillation for Reference-less SummarizationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The current winning recipe for automatic summarization is using proprietary large-scale language models (LLMs) such as ChatGPT as is, or imitation learning from them as teacher models. While increasingly ubiquitous dependence on such large-scale language models is convenient, there remains an important question of whether small-scale models could have achieved competitive results, if we were to seek an alternative learning method -- that allows for a more cost-efficient, controllable, yet powerful summarizer. We present InfoSumm, a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization, without relying on either the LLM's capability or human-written references. To achieve this, we first propose a novel formulation of the desiderata of summarization (saliency, faithfulness and brevity) through the lens of mutual information between the original document and the summary. Based on this formulation, we start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization, then self-train the model to optimize for the information-centric measures of ideal summaries. Distilling from the improved teacher, we arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT, without ever relying on ChatGPT's capabilities. Extensive analysis demonstrates that our approach outperforms in-domain supervised models in human evaluation, let alone state-of-the-art unsupervised methods, and wins over ChatGPT in controllable summarization.
- [1852] arXiv:2403.13784 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AIMatt White , Ibrahim Haddad , Cailean Osborne , Xiao-Yang (Yanglet)Liu, Ahmed Abdelmonsef , Sachin VargheseComments: 45 pagesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
Abstract: Generative AI (GAI) offers unprecedented possibilities but its commercialization has raised concerns about transparency, reproducibility, bias, and safety. Many "open-source" GAI models lack the necessary components for full understanding and reproduction, and some use restrictive licenses, a practice known as "openwashing." We propose the Model Openness Framework (MOF), a ranked classification system that rates machine learning models based on their completeness and openness, following principles of open science, open source, open data, and open access. The MOF requires specific components of the model development lifecycle to be included and released under appropriate open licenses. This framework aims to prevent misrepresentation of models claiming to be open, guide researchers and developers in providing all model components under permissive licenses, and help companies, academia, and hobbyists identify models that can be safely adopted without restrictions. Wide adoption of the MOF will foster a more open AI ecosystem, accelerating research, innovation, and adoption.
- [1853] arXiv:2403.13798 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Hierarchical NeuroSymbolic Approach for Action Quality AssessmentSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
Abstract: Action quality assessment (AQA) applies computer vision to quantitatively assess the performance or execution of a human action. Current AQA approaches are end-to-end neural models, which lack transparency and tend to be biased because they are trained on subjective human judgements as ground-truth. To address these issues, we introduce a neuro-symbolic paradigm for AQA, which uses neural networks to abstract interpretable symbols from video data and makes quality assessments by applying rules to those symbols. We take diving as the case study. We found that domain experts prefer our system and find it more informative than purely neural approaches to AQA in diving. Our system also achieves state-of-the-art action recognition and temporal segmentation, and automatically generates a detailed report that breaks the dive down into its elements and provides objective scoring with visual evidence. As verified by a group of domain experts, this report may be used to assist judges in scoring, help train judges, and provide feedback to divers. We will open-source all of our annotated training data and code for ease of reproducibility.
- [1854] arXiv:2403.13799 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Reverse Training to Nurse the Reversal CurseSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models (LLMs) have a surprising failure: when trained on "A has a feature B", they do not generalize to "B is a feature of A", which is termed the Reversal Curse. Even when training with trillions of tokens this issue still appears due to Zipf's law - hence even if we train on the entire internet. This work proposes an alternative training scheme, called reverse training, whereby all words are used twice, doubling the amount of available tokens. The LLM is trained in both forward and reverse directions by reversing the training strings while preserving (i.e., not reversing) chosen substrings, such as entities. We show that data-matched reverse-trained models provide superior performance to standard models on standard tasks, and compute-matched reverse-trained models provide far superior performance on reversal tasks, helping resolve the reversal curse issue.
- [1855] arXiv:2403.13801 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMsComments: 8 pages, 2 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: We demonstrate experimental results with LLMs that address robotics task planning problems. Recently, LLMs have been applied in robotics task planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates task planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies with pre-defined APIs. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks. The project website: this https URL
- [1856] arXiv:2403.13802 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ZigMa: A DiT-style Zigzag Mamba Diffusion ModelVincent Tao Hu , Stefan Andreas Baumann , Ming Gui , Olga Grebenkova , Pingchuan Ma , Johannes Fischer , Björn OmmerComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$ . Code will be released at this https URL
- [1857] arXiv:2403.13805 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: RAR: Retrieving And Ranking Augmented MLLMs for Visual RecognitionZiyu Liu , Zeyi Sun , Yuhang Zang , Wei Li , Pan Zhang , Xiaoyi Dong , Yuanjun Xiong , Dahua Lin , Jiaqi WangComments: Project: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.
- [1858] arXiv:2403.13808 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: On Pretraining Data Diversity for Self-Supervised LearningComments: Under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models will be available at this https URL .
- [1859] arXiv:2403.13809 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Predicting Confinement Effect of Carbon Fiber Reinforced Polymers on Strength of Concrete using Metaheuristics-based Artificial Neural NetworksSarmed Wahab , Mohamed Suleiman , Faisal Shabbir , Nasim Shakouri Mahmoudabadi , Sarmad Waqas , Nouman Herl , Afaq AhmadComments: 28 Pages, 19 FiguresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: This article deals with the study of predicting the confinement effect of carbon fiber reinforced polymers (CFRPs) on concrete cylinder strength using metaheuristics-based artificial neural networks. A detailed database of 708 CFRP confined concrete cylinders is developed from previously published research with information on 8 parameters including geometrical parameters like the diameter (d) and height (h) of a cylinder, unconfined compressive strength of concrete (fco'), thickness (nt), the elastic modulus of CFRP (Ef), unconfined concrete strain confined concrete strain and the ultimate compressive strength of confined concrete fcc'. Three metaheuristic models are implemented including particle swarm optimization (PSO), grey wolf optimizer (GWO), and bat algorithm (BA). These algorithms are trained on the data using an objective function of mean square error and their predicted results are validated against the experimental studies and finite element analysis. The study shows that the hybrid model of PSO predicted the strength of CFRP-confined concrete cylinders with maximum accuracy of 99.13% and GWO predicted the results with an accuracy of 98.17%. The high accuracy of axial compressive strength predictions demonstrated that these prediction models are a reliable solution to the empirical methods. The prediction models are especially suitable for avoiding full-scale time-consuming experimental tests that make the process quick and economical.
- [1860] arXiv:2403.13812 (cross-list from cs.DL) [ pdf , ps , other ]
-
Title: Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection ToolComments: 8 pages, 6 figures, 1 tableSubjects: Digital Libraries (cs.DL) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Other Statistics (stat.OT)
Abstract: Many people are interested in ChatGPT since it has become a prominent AIGC model that provides high-quality responses in various contexts, such as software development and maintenance. Misuse of ChatGPT might cause significant issues, particularly in public safety and education, despite its immense potential. The majority of researchers choose to publish their work on Arxiv. The effectiveness and originality of future work depend on the ability to detect AI components in such contributions. To address this need, this study will analyze a method that can see purposely manufactured content that academic organizations use to post on Arxiv. For this study, a dataset was created using physics, mathematics, and computer science articles. Using the newly built dataset, the following step is to put this http URL through its paces. The statistical analysis shows that this http URL is very accurate, with a rate of 98%.
- [1861] arXiv:2403.13825 (cross-list from physics.ins-det) [ pdf , ps , other ]
-
Title: Deep Generative Models for Ultra-High Granularity Particle Physics Detector Simulation: A Voyage From Emulation to ExtrapolationComments: PhD thesis, 234 pagesSubjects: Instrumentation and Detectors (physics.ins-det) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
Abstract: Simulating ultra-high-granularity detector responses in Particle Physics represents a critical yet computationally demanding task. This thesis aims to overcome this challenge for the Pixel Vertex Detector (PXD) at the Belle II experiment, which features over 7.5M pixel channels-the highest spatial resolution detector simulation dataset ever analysed with generative models. This thesis starts off by a comprehensive and taxonomic review on generative models for simulating detector signatures. Then, it presents the Intra-Event Aware Generative Adversarial Network (IEA-GAN), a new geometry-aware generative model that introduces a relational attentive reasoning and Self-Supervised Learning to approximate an "event" in the detector. This study underscores the importance of intra-event correlation for downstream physics analyses. Building upon this, the work drifts towards a more generic approach and presents YonedaVAE, a Category Theory-inspired generative model that tackles the open problem of Out-of-Distribution (OOD) simulation. YonedaVAE introduces a learnable Yoneda embedding to capture the entirety of an event based on its sensor relationships, formulating a Category theoretical language for intra-event relational reasoning. This is complemented by introducing a Self-Supervised learnable prior for VAEs and an Adaptive Top-q sampling mechanism, enabling the model to sample point clouds with variable intra-category cardinality in a zero-shot manner. Variable Intra-event cardinality has not been approached before and is vital for simulating irregular detector geometries. Trained on an early experiment data, YonedaVAE can reach a reasonable OOD simulation precision of a later experiment with almost double luminosity. This study introduces, for the first time, the results of using deep generative models for ultra-high granularity detector simulation in Particle Physics.
- [1862] arXiv:2403.13835 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing FeesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
Abstract: The advancement of Large Language Models (LLMs) has significantly boosted performance in natural language processing (NLP) tasks. However, the deployment of high-performance LLMs incurs substantial costs, primarily due to the increased number of parameters aimed at enhancing model performance. This has made the use of state-of-the-art LLMs more expensive for end-users. AI service providers, such as OpenAI and Anthropic, often offer multiple versions of LLMs with varying prices and performance. However, end-users still face challenges in choosing the appropriate LLM for their tasks that balance result quality with cost.
We introduce SMART, Scaling Models Adaptively for Reduced Token Fees, a novel LLM framework designed to minimize the inference costs of NLP tasks while ensuring sufficient result quality. It enables users to specify an accuracy constraint in terms of the equivalence of outputs to those of the most powerful LLM. SMART then generates results that deviate from the outputs of this LLM only with a probability below a user-defined threshold. SMART employs a profiling phase that evaluates the performance of multiple LLMs to identify those that meet the user-defined accuracy level. SMART optimizes the tradeoff between profiling overheads and the anticipated cost savings resulting from profiling. Moreover, our approach significantly reduces inference costs by strategically leveraging a mix of LLMs. Our experiments on three real-world datasets show that, based on OpenAI models, SMART achieves significant cost savings, up to 25.6x in comparison to GPT-4. - [1863] arXiv:2403.13839 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning ResearchersComments: 16 pages, 2 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Abstract: PyTorch \texttt{2.x} introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, adapting to the PyTorch compiler to full potential can be challenging. The compiler operates at the Python bytecode level, making it appear as an opaque box. To address this, we introduce \texttt{depyf}, a tool designed to demystify the inner workings of the PyTorch compiler. \texttt{depyf} decompiles bytecode generated by PyTorch back into equivalent source code, and establishes connections between in-memory code objects and their on-disk source code counterparts. This feature enables users to step through the source code line by line using debuggers, thus enhancing their understanding of the underlying processes. Notably, \texttt{depyf} is non-intrusive and user-friendly, primarily relying on two convenient context managers for its core functionality. The project is \href{ this https URL }{ openly available} and is recognized as a \href{ this https URL }{PyTorch ecosystem project}.
- [1864] arXiv:2403.13840 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Whose Side Are You On? Investigating the Political Stance of Large Language ModelsPagnarasmey Pit , Xingjun Ma , Mike Conway , Qingyu Chen , James Bailey , Henry Pit , Putrasmey Keo , Watey Diep , Yu-Gang JiangSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Abstract: Large Language Models (LLMs) have gained significant popularity for their application in various everyday tasks such as text generation, summarization, and information retrieval. As the widespread adoption of LLMs continues to surge, it becomes increasingly crucial to ensure that these models yield responses that are politically impartial, with the aim of preventing information bubbles, upholding fairness in representation, and mitigating confirmation bias. In this paper, we propose a quantitative framework and pipeline designed to systematically investigate the political orientation of LLMs. Our investigation delves into the political alignment of LLMs across a spectrum of eight polarizing topics, spanning from abortion to LGBTQ issues. Across topics, the results indicate that LLMs exhibit a tendency to provide responses that closely align with liberal or left-leaning perspectives rather than conservative or right-leaning ones when user queries include details pertaining to occupation, race, or political affiliation. The findings presented in this study not only reaffirm earlier observations regarding the left-leaning characteristics of LLMs but also surface particular attributes, such as occupation, that are particularly susceptible to such inclinations even when directly steered towards conservatism. As a recommendation to avoid these models providing politicised responses, users should be mindful when crafting queries, and exercise caution in selecting neutral prompt language.
- [1865] arXiv:2403.13841 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Integrating Wearable Sensor Data and Self-reported Diaries for Personalized Affect ForecastingZhongqi Yang , Yuning Wang , Ken S. Yamashita , Maryam Sabah , Elahe Khatibi , Iman Azimi , Nikil Dutt , Jessica L. Borelli , Amir M. RahmaniComments: Accepted by Connected Health: Applications, Systems and Engineering Technologies (CHASE) 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Emotional states, as indicators of affect, are pivotal to overall health, making their accurate prediction before onset crucial. Current studies are primarily centered on immediate short-term affect detection using data from wearable and mobile devices. These studies typically focus on objective sensory measures, often neglecting other forms of self-reported information like diaries and notes. In this paper, we propose a multimodal deep learning model for affect status forecasting. This model combines a transformer encoder with a pre-trained language model, facilitating the integrated analysis of objective metrics and self-reported diaries. To validate our model, we conduct a longitudinal study, enrolling college students and monitoring them over a year, to collect an extensive dataset including physiological, environmental, sleep, metabolic, and physical activity parameters, alongside open-ended textual diaries provided by the participants. Our results demonstrate that the proposed model achieves predictive accuracy of 82.50% for positive affect and 82.76% for negative affect, a full week in advance. The effectiveness of our model is further elevated by its explainability.
- [1866] arXiv:2403.13843 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Machine Learning and Vision Transformers for Thyroid Carcinoma Diagnosis: A reviewYassine Habchi , Hamza Kheddar , Yassine Himeur , Abdelkrim Boukabou , Ammar Chouchane , Abdelmalik Ouamane , Shadi Atalla , Wathiq MansoorSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Abstract: The growing interest in developing smart diagnostic systems to help medical experts process extensive data for treating incurable diseases has been notable. In particular, the challenge of identifying thyroid cancer (TC) has seen progress with the use of machine learning (ML) and big data analysis, incorporating transformers to evaluate TC prognosis and determine the risk of malignancy in individuals. This review article presents a summary of various studies on AIbased approaches, especially those employing transformers, for diagnosing TC. It introduces a new categorization system for these methods based on artifcial intelligence (AI) algorithms, the goals of the framework, and the computing environments used. Additionally, it scrutinizes and contrasts the available TC datasets by their features. The paper highlights the importance of AI instruments in aiding the diagnosis and treatment of TC through supervised, unsupervised, or mixed approaches, with a special focus on the ongoing importance of transformers in medical diagnostics and disease management. It further discusses the progress made and the continuing obstacles in this area. Lastly, it explores future directions and focuses within this research feld.
- [1867] arXiv:2403.13844 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Scheduled Knowledge Acquisition on Lightweight Vector Symbolic Architectures for Brain-Computer InterfacesComments: Accepted as a full paper by the tinyML Research Symposium 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Brain-Computer interfaces (BCIs) are typically designed to be lightweight and responsive in real-time to provide users timely feedback. Classical feature engineering is computationally efficient but has low accuracy, whereas the recent neural networks (DNNs) improve accuracy but are computationally expensive and incur high latency. As a promising alternative, the low-dimensional computing (LDC) classifier based on vector symbolic architecture (VSA), achieves small model size yet higher accuracy than classical feature engineering methods. However, its accuracy still lags behind that of modern DNNs, making it challenging to process complex brain signals. To improve the accuracy of a small model, knowledge distillation is a popular method. However, maintaining a constant level of distillation between the teacher and student models may not be the best way for a growing student during its progressive learning stages. In this work, we propose a simple scheduled knowledge distillation method based on curriculum data order to enable the student to gradually build knowledge from the teacher model, controlled by an $\alpha$ scheduler. Meanwhile, we employ the LDC/VSA as the student model to enhance the on-device inference efficiency for tiny BCI devices that demand low latency. The empirical results have demonstrated that our approach achieves better tradeoff between accuracy and hardware efficiency compared to other methods.
- [1868] arXiv:2403.13845 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Learning to better see the unseen: Broad-Deep Mixed Anti-Forgetting Framework for Incremental Zero-Shot Fault DiagnosisSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Zero-shot fault diagnosis (ZSFD) is capable of identifying unseen faults via predicting fault attributes labeled by human experts. We first recognize the demand of ZSFD to deal with continuous changes in industrial processes, i.e., the model's ability to adapt to new fault categories and attributes while avoiding forgetting the diagnosis ability learned previously. To overcome the issue that the existing ZSFD paradigm cannot learn from evolving streams of training data in industrial scenarios, the incremental ZSFD (IZSFD) paradigm is proposed for the first time, which incorporates category increment and attribute increment for both traditional ZSFD and generalized ZSFD paradigms. To achieve IZSFD, we present a broad-deep mixed anti-forgetting framework (BDMAFF) that aims to learn from new fault categories and attributes. To tackle the issue of forgetting, BDMAFF effectively accumulates previously acquired knowledge from two perspectives: features and attribute prototypes. The feature memory is established through a deep generative model that employs anti-forgetting training strategies, ensuring the generation quality of historical categories is supervised and maintained. The diagnosis model SEEs the UNSEEN faults with the help of generated samples from the generative model. The attribute prototype memory is established through a diagnosis model inspired by the broad learning system. Unlike traditional incremental learning algorithms, BDMAFF introduces a memory-driven iterative update strategy for the diagnosis model, which allows the model to learn new faults and attributes without requiring the storage of all historical training samples. The effectiveness of the proposed method is verified by a real hydraulic system and the Tennessee-Eastman benchmark process.
- [1869] arXiv:2403.13846 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Clustering Method with Graph Maximum Decoding InformationComments: 9 pages, 9 figures, IJCNN 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: The clustering method based on graph models has garnered increased attention for its widespread applicability across various knowledge domains. Its adaptability to integrate seamlessly with other relevant applications endows the graph model-based clustering analysis with the ability to robustly extract "natural associations" or "graph structures" within datasets, facilitating the modelling of relationships between data points. Despite its efficacy, the current clustering method utilizing the graph-based model overlooks the uncertainty associated with random walk access between nodes and the embedded structural information in the data. To address this gap, we present a novel Clustering method for Maximizing Decoding Information within graph-based models, named CMDI. CMDI innovatively incorporates two-dimensional structural information theory into the clustering process, consisting of two phases: graph structure extraction and graph vertex partitioning. Within CMDI, graph partitioning is reformulated as an abstract clustering problem, leveraging maximum decoding information to minimize uncertainty associated with random visits to vertices. Empirical evaluations on three real-world datasets demonstrate that CMDI outperforms classical baseline methods, exhibiting a superior decoding information ratio (DI-R). Furthermore, CMDI showcases heightened efficiency, particularly when considering prior knowledge (PK). These findings underscore the effectiveness of CMDI in enhancing decoding information quality and computational efficiency, positioning it as a valuable tool in graph-based clustering analyses.
- [1870] arXiv:2403.13847 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Optimal Transport for Domain Adaptation through Gaussian Mixture ModelsComments: 10 pages,5 figures,under reviewSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: In this paper we explore domain adaptation through optimal transport. We propose a novel approach, where we model the data distributions through Gaussian mixture models. This strategy allows us to solve continuous optimal transport through an equivalent discrete problem. The optimal transport solution gives us a matching between source and target domain mixture components. From this matching, we can map data points between domains, or transfer the labels from the source domain components towards the target domain. We experiment with 2 domain adaptation benchmarks in fault diagnosis, showing that our methods have state-of-the-art performance.
- [1871] arXiv:2403.13848 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Smooth Sensitivity for Learning Differentially-Private yet Accurate Rule ListsTimothée Ly (LAAS-ROC), Julien Ferry (EPM), Marie-José Huguet (LAAS-ROC), Sébastien Gambs (UQAM), Ulrich Aivodji (ETS)Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Differentially-private (DP) mechanisms can be embedded into the design of a machine learningalgorithm to protect the resulting model against privacy leakage, although this often comes with asignificant loss of accuracy. In this paper, we aim at improving this trade-off for rule lists modelsby establishing the smooth sensitivity of the Gini impurity and leveraging it to propose a DP greedyrule list algorithm. In particular, our theoretical analysis and experimental results demonstrate thatthe DP rule lists models integrating smooth sensitivity have higher accuracy that those using otherDP frameworks based on global sensitivity.
- [1872] arXiv:2403.13849 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Graphs Unveiled: Graph Neural Networks and Graph GenerationSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: One of the hot topics in machine learning is the field of GNN. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. This paper represents a survey, providing a comprehensive overview of Graph Neural Networks (GNNs). We discuss the applications of graph neural networks across various domains. Finally, we present an advanced field in GNNs: graph generation.
- [1873] arXiv:2403.13850 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Spatio-Temporal Fluid Dynamics Modeling via Physical-Awareness and Parameter Diffusion GuidanceHao Wu , Fan Xu , Yifan Duan , Ziwei Niu , Weiyan Wang , Gaofeng Lu , Kun Wang , Yuxuan Liang , Yang WangSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
Abstract: This paper proposes a two-stage framework named ST-PAD for spatio-temporal fluid dynamics modeling in the field of earth sciences, aiming to achieve high-precision simulation and prediction of fluid dynamics through spatio-temporal physics awareness and parameter diffusion guidance. In the upstream stage, we design a vector quantization reconstruction module with temporal evolution characteristics, ensuring balanced and resilient parameter distribution by introducing general physical constraints. In the downstream stage, a diffusion probability network involving parameters is utilized to generate high-quality future states of fluids, while enhancing the model's generalization ability by perceiving parameters in various physical setups. Extensive experiments on multiple benchmark datasets have verified the effectiveness and robustness of the ST-PAD framework, which showcase that ST-PAD outperforms current mainstream models in fluid dynamics modeling and prediction, especially in effectively capturing local representations and maintaining significant advantages in OOD generations.
- [1874] arXiv:2403.13863 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic ModelComments: 26 pages, 6 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Databases (cs.DB)
Abstract: Tabular data plays a crucial role in various domains but often suffers from missing values, thereby curtailing its potential utility. Traditional imputation techniques frequently yield suboptimal results and impose substantial computational burdens, leading to inaccuracies in subsequent modeling tasks. To address these challenges, we propose DiffImpute, a novel Denoising Diffusion Probabilistic Model (DDPM). Specifically, DiffImpute is trained on complete tabular datasets, ensuring that it can produce credible imputations for missing entries without undermining the authenticity of the existing data. Innovatively, it can be applied to various settings of Missing Completely At Random (MCAR) and Missing At Random (MAR). To effectively handle the tabular features in DDPM, we tailor four tabular denoising networks, spanning MLP, ResNet, Transformer, and U-Net. We also propose Harmonization to enhance coherence between observed and imputed data by infusing the data back and denoising them multiple times during the sampling stage. To enable efficient inference while maintaining imputation performance, we propose a refined non-Markovian sampling process that works along with Harmonization. Empirical evaluations on seven diverse datasets underscore the prowess of DiffImpute. Specifically, when paired with the Transformer as the denoising network, it consistently outperforms its competitors, boasting an average ranking of 1.7 and the most minimal standard deviation. In contrast, the next best method lags with a ranking of 2.8 and a standard deviation of 0.9. The code is available at this https URL .
- [1875] arXiv:2403.13866 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: The Bid Picture: Auction-Inspired Multi-player Generative Adversarial Networks TrainingSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This article proposes auction-inspired multi-player generative adversarial networks training, which mitigates the mode collapse problem of GANs. Mode collapse occurs when an over-fitted generator generates a limited range of samples, often concentrating on a small subset of the data distribution. Despite the restricted diversity of generated samples, the discriminator can still be deceived into distinguishing these samples as real samples from the actual distribution. In the absence of external standards, a model cannot recognize its failure during the training phase. We extend the two-player game of generative adversarial networks to the multi-player game. During the training, the values of each model are determined by the bids submitted by other players in an auction-like process.
- [1876] arXiv:2403.13869 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Accurately Predicting Probabilities of Safety-Critical Rare Events for Intelligent SystemsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Intelligent systems are increasingly integral to our daily lives, yet rare safety-critical events present significant latent threats to their practical deployment. Addressing this challenge hinges on accurately predicting the probability of safety-critical events occurring within a given time step from the current state, a metric we define as 'criticality'. The complexity of predicting criticality arises from the extreme data imbalance caused by rare events in high dimensional variables associated with the rare events, a challenge we refer to as the curse of rarity. Existing methods tend to be either overly conservative or prone to overlooking safety-critical events, thus struggling to achieve both high precision and recall rates, which severely limits their applicability. This study endeavors to develop a criticality prediction model that excels in both precision and recall rates for evaluating the criticality of safety-critical autonomous systems. We propose a multi-stage learning framework designed to progressively densify the dataset, mitigating the curse of rarity across stages. To validate our approach, we evaluate it in two cases: lunar lander and bipedal walker scenarios. The results demonstrate that our method surpasses traditional approaches, providing a more accurate and dependable assessment of criticality in intelligent systems.
- [1877] arXiv:2403.13890 (cross-list from eess.IV) [ pdf , ps , other ]
-
Title: Towards Learning Contrast Kinetics with Multi-Condition Latent Diffusion ModelsRichard Osuala , Daniel Lang , Preeti Verma , Smriti Joshi , Apostolia Tsirikoglou , Grzegorz Skorupko , Kaisar Kushibar , Lidia Garrucho , Walter H. L. Pinaya , Oliver Diaz , Julia Schnabel , Karim LekadirSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Contrast agents in dynamic contrast enhanced magnetic resonance imaging allow to localize tumors and observe their contrast kinetics, which is essential for cancer characterization and respective treatment decision-making. However, contrast agent administration is not only associated with adverse health risks, but also restricted for patients during pregnancy, and for those with kidney malfunction, or other adverse reactions. With contrast uptake as key biomarker for lesion malignancy, cancer recurrence risk, and treatment response, it becomes pivotal to reduce the dependency on intravenous contrast agent administration. To this end, we propose a multi-conditional latent diffusion model capable of acquisition time-conditioned image synthesis of DCE-MRI temporal sequences. To evaluate medical image synthesis, we additionally propose and validate the Fréchet radiomics distance as an image quality measure based on biomarker variability between synthetic and real imaging data. Our results demonstrate our method's ability to generate realistic multi-sequence fat-saturated breast DCE-MRI and uncover the emerging potential of deep learning based contrast kinetics simulation. We publicly share our accessible codebase at this https URL and provide a user-friendly library for Fréchet radiomics distance calculation at this https URL .
- [1878] arXiv:2403.13925 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reducing Large Language Model Bias with Emphasis on 'Restricted Industries': Automated Dataset Augmentation and Prejudice QuantificationSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Despite the growing capabilities of large language models, there exists concerns about the biases they develop. In this paper, we propose a novel, automated mechanism for debiasing through specified dataset augmentation in the lens of bias producers and in the context of 'restricted industries' with limited data. We additionally create two new additional metrics, the mb-index and db-index, to quantify bias, considering the idea that bias occurs due to both intrinsic model architecture and dataset.
- [1879] arXiv:2403.13940 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Multi-criteria approach for selecting an explanation from the set of counterfactuals produced by an ensemble of explainersComments: 17 pages, 2 figuresJournal-ref: International Journal of Applied Mathematics and Computer Science 34 1 (2024) 119-133Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Counterfactuals are widely used to explain ML model predictions by providing alternative scenarios for obtaining the more desired predictions. They can be generated by a variety of methods that optimize different, sometimes conflicting, quality measures and produce quite different solutions. However, choosing the most appropriate explanation method and one of the generated counterfactuals is not an easy task. Instead of forcing the user to test many different explanation methods and analysing conflicting solutions, in this paper, we propose to use a multi-stage ensemble approach that will select single counterfactual based on the multiple-criteria analysis. It offers a compromise solution that scores well on several popular quality measures. This approach exploits the dominance relation and the ideal point decision aid method, which selects one counterfactual from the Pareto front. The conducted experiments demonstrated that the proposed approach generates fully actionable counterfactuals with attractive compromise values of the considered quality measures.
- [1880] arXiv:2403.13947 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: BlendScape: Enabling Unified and Personalized Video-Conferencing Environments through Generative AIShwetha Rajaram , Nels Numan , Balasaravanan Thoravi Kumaravel , Nicolai Marquardt , Andrew D. WilsonSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Today's video-conferencing tools support a rich range of professional and social activities, but their generic, grid-based environments cannot be easily adapted to meet the varying needs of distributed collaborators. To enable end-user customization, we developed BlendScape, a system for meeting participants to compose video-conferencing environments tailored to their collaboration context by leveraging AI image generation techniques. BlendScape supports flexible representations of task spaces by blending users' physical or virtual backgrounds into unified environments and implements multimodal interaction techniques to steer the generation. Through an evaluation with 15 end-users, we investigated their customization preferences for work and social scenarios. Participants could rapidly express their design intentions with BlendScape and envisioned using the system to structure collaboration in future meetings, but experienced challenges with preventing distracting elements. We implement scenarios to demonstrate BlendScape's expressiveness in supporting distributed collaboration techniques from prior work and propose composition techniques to improve the quality of environments.
- [1881] arXiv:2403.13950 (cross-list from cs.NE) [ pdf , ps , other ]
-
Title: Evo* 2023 -- Late-Breaking Abstracts VolumeComments: LBAs accepted in Evo* 2023. Part of the Conference ProceedingsSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Volume with the Late-Breaking Abstracts submitted to the Evo* 2023 Conference, held in Brno (Czech Republic), from 12 to 14 of April. These papers present ongoing research and preliminary results investigating on the application of different approaches of Bioinspired Methods (mainly Evolutionary Computation) to different problems, most of them real world ones.
- [1882] arXiv:2403.13951 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-OnSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Virtual Try-on (VTON) involves generating images of a person wearing selected garments. Diffusion-based methods, in particular, can create high-quality images, but they struggle to maintain the identities of the input garments. We identified this problem stems from the specifics in the training formulation for diffusion. To address this, we propose a unique training scheme that limits the scope in which diffusion is trained. We use a control image that perfectly aligns with the target image during training. In turn, this accurately preserves garment details during inference. We demonstrate our method not only effectively conserves garment details but also allows for layering, styling, and shoe try-on. Our method runs multi-garment try-on in a single inference cycle and can support high-quality zoomed-in generations without training in higher resolutions. Finally, we show our method surpasses prior methods in accuracy and quality.
- [1883] arXiv:2403.13960 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Open Access NAO (OAN): a ROS2-based software framework for HRI applications with the NAO robotComments: 7 pages, 3 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a new software framework for HRI experimentation with the sixth version of the common NAO robot produced by the United Robotics Group. Embracing the common demand of researchers for better performance and new features for NAO, the authors took advantage of the ability to run ROS2 onboard on the NAO to develop a framework independent of the APIs provided by the manufacturer. Such a system provides NAO with not only the basic skills of a humanoid robot such as walking and reproducing movements of interest but also features often used in HRI such as: speech recognition/synthesis, face and object detention, and the use of Generative Pre-trained Transformer (GPT) models for conversation. The developed code is therefore configured as a ready-to-use but also highly expandable and improvable tool thanks to the possibilities provided by the ROS community.
- [1884] arXiv:2403.13969 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: "This is not a data problem": Algorithms and Power in Public Higher Education in CanadaComments: In CHI '24 Proceedings of the CHI Conference on Human Factors in Computing Systems Honolulu, HI, USASubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: Algorithmic decision-making is increasingly being adopted across public higher education. The expansion of data-driven practices by post-secondary institutions has occurred in parallel with the adoption of New Public Management approaches by neoliberal administrations. In this study, we conduct a qualitative analysis of an in-depth ethnographic case study of data and algorithms in use at a public college in Ontario, Canada. We identify the data, algorithms, and outcomes in use at the college. We assess how the college's processes and relationships support those outcomes and the different stakeholders' perceptions of the college's data-driven systems. In addition, we find that the growing reliance on algorithmic decisions leads to increased student surveillance, exacerbation of existing inequities, and the automation of the faculty-student relationship. Finally, we identify a cycle of increased institutional power perpetuated by algorithmic decision-making, and driven by a push towards financial sustainability.
- [1885] arXiv:2403.14006 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: On Prompt Sensitivity of ChatGPT in Affective ComputingComments: 2 Tables, 1 Figure, preprint submission to ACII 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Recent studies have demonstrated the emerging capabilities of foundation models like ChatGPT in several fields, including affective computing. However, accessing these emerging capabilities is facilitated through prompt engineering. Despite the existence of some prompting techniques, the field is still rapidly evolving and many prompting ideas still require investigation. In this work, we introduce a method to evaluate and investigate the sensitivity of the performance of foundation models based on different prompts or generation parameters. We perform our evaluation on ChatGPT within the scope of affective computing on three major problems, namely sentiment analysis, toxicity detection, and sarcasm detection. First, we carry out a sensitivity analysis on pivotal parameters in auto-regressive text generation, specifically the temperature parameter $T$ and the top-$p$ parameter in Nucleus sampling, dictating how conservative or creative the model should be during generation. Furthermore, we explore the efficacy of several prompting ideas, where we explore how giving different incentives or structures affect the performance. Our evaluation takes into consideration performance measures on the affective computing tasks, and the effectiveness of the model to follow the stated instructions, hence generating easy-to-parse responses to be smoothly used in downstream applications.
- [1886] arXiv:2403.14019 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Searching Search Spaces: Meta-evolving a Geometric Encoding for Neural NetworksComments: 9 pages, 8 figuresSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: In evolutionary policy search, neural networks are usually represented using a direct mapping: each gene encodes one network weight. Indirect encoding methods, where each gene can encode for multiple weights, shorten the genome to reduce the dimensions of the search space and better exploit permutations and symmetries. The Geometric Encoding for Neural network Evolution (GENE) introduced an indirect encoding where the weight of a connection is computed as the (pseudo-)distance between the two linked neurons, leading to a genome size growing linearly with the number of genes instead of quadratically in direct encoding. However GENE still relies on hand-crafted distance functions with no prior optimization. Here we show that better performing distance functions can be found for GENE using Cartesian Genetic Programming (CGP) in a meta-evolution approach, hence optimizing the encoding to create a search space that is easier to exploit. We show that GENE with a learned function can outperform both direct encoding and the hand-crafted distances, generalizing on unseen problems, and we study how the encoding impacts neural network properties.
- [1887] arXiv:2403.14037 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News DetectionSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Misinformation can seriously impact society, affecting anything from public opinion to institutional confidence and the political horizon of a state. Fake News (FN) proliferation on online websites and Online Social Networks (OSNs) has increased profusely. Various fact-checking websites include news in English and barely provide information about FN in regional languages. Thus the Urdu FN purveyors cannot be discerned using factchecking portals. SOTA approaches for Fake News Detection (FND) count upon appropriately labelled and large datasets. FND in regional and resource-constrained languages lags due to the lack of limited-sized datasets and legitimate lexical resources. The previous datasets for Urdu FND are limited-sized, domain-restricted, publicly unavailable and not manually verified where the news is translated from English into Urdu. In this paper, we curate and contribute the first largest publicly available dataset for Urdu FND, Ax-to-Grind Urdu, to bridge the identified gaps and limitations of existing Urdu datasets in the literature. It constitutes 10,083 fake and real news on fifteen domains collected from leading and authentic Urdu newspapers and news channel websites in Pakistan and India. FN for the Ax-to-Grind dataset is collected from websites and crowdsourcing. The dataset contains news items in Urdu from the year 2017 to the year 2023. Expert journalists annotated the dataset. We benchmark the dataset with an ensemble model of mBERT,XLNet, and XLM RoBERTa. The selected models are originally trained on multilingual large corpora. The results of the proposed model are based on performance metrics, F1-score, accuracy, precision, recall and MCC value.
- [1888] arXiv:2403.14049 (cross-list from cs.RO) [ pdf , ps , other ]
-
Title: A Roadmap Towards Automated and Regulated Robotic SystemsComments: 17 pages, 9 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: The rapid development of generative technology opens up possibility for higher level of automation, and artificial intelligence (AI) embodiment in robotic systems is imminent. However, due to the blackbox nature of the generative technology, the generation of the knowledge and workflow scheme is uncontrolled, especially in a dynamic environment and a complex scene. This poses challenges to regulations in safety-demanding applications such as medical scenes. We argue that the unregulated generative processes from AI is fitted for low level end tasks, but intervention in the form of manual or automated regulation should happen post-workflow-generation and pre-robotic-execution. To address this, we propose a roadmap that can lead to fully automated and regulated robotic systems. In this paradigm, the high level policies are generated as structured graph data, enabling regulatory oversight and reusability, while the code base for lower level tasks is generated by generative models. Our approach aims the transitioning from expert knowledge to regulated action, akin to the iterative processes of study, practice, scrutiny, and execution in human tasks. We identify the generative and deterministic processes in a design cycle, where generative processes serve as a text-based world simulator and the deterministic processes generate the executable system. We propose State Machine Seralization Language (SMSL) to be the conversion point between text simulator and executable workflow control. From there, we analyze the modules involved based on the current literature, and discuss human in the loop. As a roadmap, this work identifies the current possible implementation and future work. This work does not provide an implemented system but envisions to inspire the researchers working on the direction in the roadmap. We implement the SMSL and D-SFO paradigm that serve as the starting point of the roadmap.
- [1889] arXiv:2403.14092 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Carbon Footprint Reduction for Sustainable Data Centers in Real-TimeSoumyendu Sarkar , Avisek Naug , Ricardo Luna , Antonio Guillen , Vineet Gundecha , Sahand Ghorbanpour , Sajad Mousavi , Dejan Markovikj , Ashwin Ramesh BabuJournal-ref: 2024 Proceedings of the AAAI Conference on Artificial IntelligenceSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Abstract: As machine learning workloads significantly increase energy consumption, sustainable data centers with low carbon emissions are becoming a top priority for governments and corporations worldwide. This requires a paradigm shift in optimizing power consumption in cooling and IT loads, shifting flexible loads based on the availability of renewable energy in the power grid, and leveraging battery storage from the uninterrupted power supply in data centers, using collaborative agents. The complex association between these optimization strategies and their dependencies on variable external factors like weather and the power grid carbon intensity makes this a hard problem. Currently, a real-time controller to optimize all these goals simultaneously in a dynamic real-world setting is lacking. We propose a Data Center Carbon Footprint Reduction (DC-CFR) multi-agent Reinforcement Learning (MARL) framework that optimizes data centers for the multiple objectives of carbon footprint reduction, energy consumption, and energy cost. The results show that the DC-CFR MARL agents effectively resolved the complex interdependencies in optimizing cooling, load shifting, and energy storage in real-time for various locations under real-world dynamic weather and grid carbon intensity conditions. DC-CFR significantly outperformed the industry standard ASHRAE controller with a considerable reduction in carbon emissions (14.5%), energy usage (14.4%), and energy cost (13.7%) when evaluated over one year across multiple geographical regions.
- [1890] arXiv:2403.14110 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Heuristic Algorithm-based Action Masking Reinforcement Learning (HAAM-RL) with Ensemble Inference MethodComments: 7 pages, 8 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: This paper presents a novel reinforcement learning (RL) approach called HAAM-RL (Heuristic Algorithm-based Action Masking Reinforcement Learning) for optimizing the color batching re-sequencing problem in automobile painting processes. The existing heuristic algorithms have limitations in adequately reflecting real-world constraints and accurately predicting logistics performance. Our methodology incorporates several key techniques including a tailored Markov Decision Process (MDP) formulation, reward setting including Potential-Based Reward Shaping, action masking using heuristic algorithms (HAAM-RL), and an ensemble inference method that combines multiple RL models. The RL agent is trained and evaluated using FlexSim, a commercial 3D simulation software, integrated with our RL MLOps platform BakingSoDA. Experimental results across 30 scenarios demonstrate that HAAM-RL with an ensemble inference method achieves a 16.25% performance improvement over the conventional heuristic algorithm, with stable and consistent results. The proposed approach exhibits superior performance and generalization capability, indicating its effectiveness in optimizing complex manufacturing processes. The study also discusses future research directions, including alternative state representations, incorporating model-based RL methods, and integrating additional real-world constraints.
- [1891] arXiv:2403.14119 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature DispersionHee Suk Yoon , Eunseop Yoon , Joshua Tian Jin Tee , Mark Hasegawa-Johnson , Yingzhen Li , Chang D. YooComments: ICLR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at this https URL .
- [1892] arXiv:2403.14120 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Advancing IIoT with Over-the-Air Federated Learning: The Role of Iterative Magnitude PruningComments: 6 pages, 6 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: The industrial Internet of Things (IIoT) under Industry 4.0 heralds an era of interconnected smart devices where data-driven insights and machine learning (ML) fuse to revolutionize manufacturing. A noteworthy development in IIoT is the integration of federated learning (FL), which addresses data privacy and security among devices. FL enables edge sensors, also known as peripheral intelligence units (PIUs) to learn and adapt using their data locally, without explicit sharing of confidential data, to facilitate a collaborative yet confidential learning process. However, the lower memory footprint and computational power of PIUs inherently require deep neural network (DNN) models that have a very compact size. Model compression techniques such as pruning can be used to reduce the size of DNN models by removing unnecessary connections that have little impact on the model's performance, thus making the models more suitable for the limited resources of PIUs. Targeting the notion of compact yet robust DNN models, we propose the integration of iterative magnitude pruning (IMP) of the DNN model being trained in an over-the-air FL (OTA-FL) environment for IIoT. We provide a tutorial overview and also present a case study of the effectiveness of IMP in OTA-FL for an IIoT environment. Finally, we present future directions for enhancing and optimizing these deep compression techniques further, aiming to push the boundaries of IIoT capabilities in acquiring compact yet robust and high-performing DNN models.
- [1893] arXiv:2403.14146 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Evolving Benchmark Functions to Compare Evolutionary Algorithms via Genetic ProgrammingSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: In this study, we use Genetic Programming (GP) to compose new optimization benchmark functions. Optimization benchmarks have the important role of showing the differences between evolutionary algorithms, making it possible for further analysis and comparisons. We show that the benchmarks generated by GP are able to differentiate algorithms better than human-made benchmark functions. The fitness measure of the GP is the Wasserstein distance of the solutions found by a pair of optimizers. Additionally, we use MAP-Elites to both enhance the search power of the GP and also illustrate how the difference between optimizers changes by various landscape features. Our approach provides a novel way to automate the design of benchmark functions and to compare evolutionary algorithms.
- [1894] arXiv:2403.14151 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Deep Learning for Trajectory Data Management and Mining: A Survey and BeyondWei Chen , Yuxuan Liang , Yuanshao Zhu , Yanchuan Chang , Kang Luo , Haomin Wen , Lei Li , Yanwei Yu , Qingsong Wen , Chao Chen , Kai Zheng , Yunjun Gao , Xiaofang Zhou , Yu ZhengComments: 25 pages, 12 figures, 5 tablesSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
Abstract: Trajectory computing is a pivotal domain encompassing trajectory data management and mining, garnering widespread attention due to its crucial role in various practical applications such as location services, urban traffic, and public safety. Traditional methods, focusing on simplistic spatio-temporal features, face challenges of complex calculations, limited scalability, and inadequate adaptability to real-world complexities. In this paper, we present a comprehensive review of the development and recent advances in deep learning for trajectory computing (DL4Traj). We first define trajectory data and provide a brief overview of widely-used deep learning models. Systematically, we explore deep learning applications in trajectory management (pre-processing, storage, analysis, and visualization) and mining (trajectory-related forecasting, trajectory-related recommendation, trajectory classification, travel time estimation, anomaly detection, and mobility generation). Notably, we encapsulate recent advancements in Large Language Models (LLMs) that hold the potential to augment trajectory computing. Additionally, we summarize application scenarios, public datasets, and toolkits. Finally, we outline current challenges in DL4Traj research and propose future directions. Relevant papers and open-source resources have been collated and are continuously updated at: \href{ this https URL }{DL4Traj Repo}.
- [1895] arXiv:2403.14156 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Policy Mirror Descent with LookaheadSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Abstract: Policy Mirror Descent (PMD) stands as a versatile algorithmic framework encompassing several seminal policy gradient algorithms such as natural policy gradient, with connections with state-of-the-art reinforcement learning (RL) algorithms such as TRPO and PPO. PMD can be seen as a soft Policy Iteration algorithm implementing regularized 1-step greedy policy improvement. However, 1-step greedy policies might not be the best choice and recent remarkable empirical successes in RL such as AlphaGo and AlphaZero have demonstrated that greedy approaches with respect to multiple steps outperform their 1-step counterpart. In this work, we propose a new class of PMD algorithms called $h$-PMD which incorporates multi-step greedy policy improvement with lookahead depth $h$ to the PMD update rule. To solve discounted infinite horizon Markov Decision Processes with discount factor $\gamma$, we show that $h$-PMD which generalizes the standard PMD enjoys a faster dimension-free $\gamma^h$-linear convergence rate, contingent on the computation of multi-step greedy policies. We propose an inexact version of $h$-PMD where lookahead action values are estimated. Under a generative model, we establish a sample complexity for $h$-PMD which improves over prior work. Finally, we extend our result to linear function approximation to scale to large state spaces. Under suitable assumptions, our sample complexity only involves dependence on the dimension of the feature map space instead of the state space size.
- [1896] arXiv:2403.14163 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal NavigationComments: will soon submit to the Elsevier journal, Advanced Engineering InformaticsSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, large language models have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a large language model. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms. For more details and real-world demonstrations, please check our project webpage ( this https URL ).
- [1897] arXiv:2403.14183 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic SegmentationComments: 22 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Abstract: The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA), which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.
- [1898] arXiv:2403.14186 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGANComments: Project website: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Graphics (cs.GR)
Abstract: We propose a method that can generate cinemagraphs automatically from a still landscape image using a pre-trained StyleGAN. Inspired by the success of recent unconditional video generation, we leverage a powerful pre-trained image generator to synthesize high-quality cinemagraphs. Unlike previous approaches that mainly utilize the latent space of a pre-trained StyleGAN, our approach utilizes its deep feature space for both GAN inversion and cinemagraph generation. Specifically, we propose multi-scale deep feature warping (MSDFW), which warps the intermediate features of a pre-trained StyleGAN at different resolutions. By using MSDFW, the generated cinemagraphs are of high resolution and exhibit plausible looping animation. We demonstrate the superiority of our method through user studies and quantitative comparisons with state-of-the-art cinemagraph generation methods and a video generation method that uses a pre-trained StyleGAN.
- [1899] arXiv:2403.14188 (cross-list from cond-mat.dis-nn) [ pdf , ps , html , other ]
-
Title: Quantum-activated neural reservoirs on-chip open up large hardware security models for resilient authenticationSubjects: Disordered Systems and Neural Networks (cond-mat.dis-nn) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Abstract: Quantum artificial intelligence is a frontier of artificial intelligence research, pioneering quantum AI-powered circuits to address problems beyond the reach of deep learning with classical architectures. This work implements a large-scale quantum-activated recurrent neural network possessing more than 3 trillion hardware nodes/cm$^2$, originating from repeatable atomic-scale nucleation dynamics in an amorphous material integrated on-chip, controlled with 0.07 nW electric power per readout channel. Compared to the best-performing reservoirs currently reported, this implementation increases the scale of the network by two orders of magnitude and reduces the power consumption by six, reaching power efficiencies in the range of the human brain, dissipating 0.2 nW/neuron. When interrogated by a classical input, the chip implements a large-scale hardware security model, enabling dictionary-free authentication secure against statistical inference attacks, including AI's present and future development, even for an adversary with a copy of all the classical components available. Experimental tests report 99.6% reliability, 100% user authentication accuracy, and an ideal 50% key uniqueness. Due to its quantum nature, the chip supports a bit density per feature size area three times higher than the best technology available, with the capacity to store more than $2^{1104}$ keys in a footprint of 1 cm$^2$. Such a quantum-powered platform could help counteract the emerging form of warfare led by the cybercrime industry in breaching authentication to target small to large-scale facilities, from private users to intelligent energy grids.
- [1900] arXiv:2403.14200 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Debiasing surgeon: fantastic weights and how to find themSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Abstract: Nowadays an ever-growing concerning phenomenon, the emergence of algorithmic biases that can lead to unfair models, emerges. Several debiasing approaches have been proposed in the realm of deep learning, employing more or less sophisticated approaches to discourage these models from massively employing these biases. However, a question emerges: is this extra complexity really necessary? Is a vanilla-trained model already embodying some ``unbiased sub-networks'' that can be used in isolation and propose a solution without relying on the algorithmic biases? In this work, we show that such a sub-network typically exists, and can be extracted from a vanilla-trained model without requiring additional training. We further validate that such specific architecture is incapable of learning a specific bias, suggesting that there are possible architectural countermeasures to the problem of biases in deep neural networks.
- [1901] arXiv:2403.14203 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Unsupervised Audio-Visual Segmentation with Modality AlignmentSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Audio-Visual Segmentation (AVS) aims to identify, at the pixel level, the object in a visual scene that produces a given sound. Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability. To address this, we introduce unsupervised AVS, eliminating the need for such expensive annotation. To tackle this more challenging problem, we propose an unsupervised learning method, named Modality Correspondence Alignment (MoCA), which seamlessly integrates off-the-shelf foundation models like DINO, SAM, and ImageBind. This approach leverages their knowledge complementarity and optimizes their joint usage for multi-modality association. Initially, we estimate positive and negative image pairs in the feature space. For pixel-level association, we introduce an audio-visual adapter and a novel pixel matching aggregation strategy within the image-level contrastive learning framework. This allows for a flexible connection between object appearance and audio signal at the pixel level, with tolerance to imaging variations such as translation and rotation. Extensive experiments on the AVSBench (single and multi-object splits) and AVSS datasets demonstrate that our MoCA outperforms strongly designed baseline methods and approaches supervised counterparts, particularly in complex scenarios with multiple auditory objects. Notably when comparing mIoU, MoCA achieves a substantial improvement over baselines in both the AVSBench (S4: +17.24%; MS3: +67.64%) and AVSS (+19.23%) audio-visual segmentation challenges.
- [1902] arXiv:2403.14227 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: PeerGPT: Probing the Roles of LLM-based Peer Agents as Team Moderators and Participants in Children's Collaborative LearningComments: To appear at CHI EA '24Subjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: In children's collaborative learning, effective peer conversations can significantly enhance the quality of children's collaborative interactions. The integration of Large Language Model (LLM) agents into this setting explores their novel role as peers, assessing impacts as team moderators and participants. We invited two groups of participants to engage in a collaborative learning workshop, where they discussed and proposed conceptual solutions to a design problem. The peer conversation transcripts were analyzed using thematic analysis. We discovered that peer agents, while managing discussions effectively as team moderators, sometimes have their instructions disregarded. As participants, they foster children's creative thinking but may not consistently provide timely feedback. These findings highlight potential design improvements and considerations for peer agents in both roles.
- [1903] arXiv:2403.14233 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: SoftPatch: Unsupervised Anomaly Detection with Noisy DataXi Jiang , Ying Chen , Qiang Nie , Yong Liu , Jianlin Liu , Bin-Bin Gao , Jun Liu , Chengjie Wang , Feng ZhengComments: 36th Conference on Neural Information Processing SystemsJournal-ref: Advances in Neural Information Processing Systems 35, ISBN: 9781713871088, (2022)Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Although mainstream unsupervised anomaly detection (AD) algorithms perform well in academic datasets, their performance is limited in practical application due to the ideal experimental setting of clean training data. Training with noisy data is an inevitable problem in real-world anomaly detection but is seldom discussed. This paper considers label-level noise in image sensory anomaly detection for the first time. To solve this problem, we proposed a memory-based unsupervised AD method, SoftPatch, which efficiently denoises the data at the patch level. Noise discriminators are utilized to generate outlier scores for patch-level noise elimination before coreset construction. The scores are then stored in the memory bank to soften the anomaly detection boundary. Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset. Comprehensive experiments in various noise scenes demonstrate that SoftPatch outperforms the state-of-the-art AD methods on the MVTecAD and BTAD benchmarks and is comparable to those methods under the setting without noise.
- [1904] arXiv:2403.14236 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: A Unified Framework for Model EditingComments: EMMET can do batched edits of batch size 10k with performance very similar to MEMITSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: We introduce a unifying framework that brings two leading "locate-and-edit" model editing techniques -- ROME and MEMIT -- under a single conceptual umbrella, optimizing for the same goal, which we call the preservation-memorization objective. ROME uses an equality constraint to perform one edit at a time, whereas MEMIT employs a more flexible least-square constraint that allows for batched edits. Following the preservation-memorization objective, we present Equality-constrained Mass Model Editing algorithm for Transformers or EMMET, a new batched memory-editing algorithm that uses a closed-form solution for the equality-constrained version of the preservation-memorization objective. EMMET is a batched-version of ROME and is able to perform batched-edits up to a batch-size of 10,000 with very similar performance to MEMIT across multiple dimensions. With EMMET, we unify and achieve symmetry within the "locate-and-edit" algorithms, allowing batched-editing using both objectives.
- [1905] arXiv:2403.14238 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-ReflectionComments: 22 pages, 5 figures, Submitted to ACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Despite the promise of RLHF in aligning LLMs with human preferences, it often leads to superficial alignment, prioritizing stylistic changes over improving downstream performance of LLMs. Underspecified preferences could obscure directions to align the models. Lacking exploration restricts identification of desirable outputs to improve the models. To overcome these challenges, we propose a novel framework: Reinforcement Learning from Reflective Feedback (RLRF), which leverages fine-grained feedback based on detailed criteria to improve the core capabilities of LLMs. RLRF employs a self-reflection mechanism to systematically explore and refine LLM responses, then fine-tuning the models via a RL algorithm along with promising responses. Our experiments across Just-Eval, Factuality, and Mathematical Reasoning demonstrate the efficacy and transformative potential of RLRF beyond superficial surface-level adjustment.
- [1906] arXiv:2403.14243 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Dermacen Analytica: A Novel Methodology Integrating Multi-Modal Large Language Models with Machine Learning in tele-dermatologySubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: The rise of Artificial Intelligence creates great promise in the field of medical discovery, diagnostics and patient management. However, the vast complexity of all medical domains require a more complex approach that combines machine learning algorithms, classifiers, segmentation algorithms and, lately, large language models. In this paper, we describe, implement and assess an Artificial Intelligence-empowered system and methodology aimed at assisting the diagnosis process of skin lesions and other skin conditions within the field of dermatology that aims to holistically address the diagnostic process in this domain. The workflow integrates large language, transformer-based vision models and sophisticated machine learning tools. This holistic approach achieves a nuanced interpretation of dermatological conditions that simulates and facilitates a dermatologist's workflow. We assess our proposed methodology through a thorough cross-model validation technique embedded in an evaluation pipeline that utilizes publicly available medical case studies of skin conditions and relevant images. To quantitatively score the system performance, advanced machine learning and natural language processing tools are employed which focus on similarity comparison and natural language inference. Additionally, we incorporate a human expert evaluation process based on a structured checklist to further validate our results. We implemented the proposed methodology in a system which achieved approximate (weighted) scores of 0.87 for both contextual understanding and diagnostic accuracy, demonstrating the efficacy of our approach in enhancing dermatological analysis. The proposed methodology is expected to prove useful in the development of next-generation tele-dermatology applications, enhancing remote consultation capabilities and access to care, especially in underserved areas.
- [1907] arXiv:2403.14244 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Isotropic Gaussian Splatting for Real-Time Radiance Field RenderingSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract: The 3D Gaussian splatting method has drawn a lot of attention, thanks to its high performance in training and high quality of the rendered image. However, it uses anisotropic Gaussian kernels to represent the scene. Although such anisotropic kernels have advantages in representing the geometry, they lead to difficulties in terms of computation, such as splitting or merging two kernels. In this paper, we propose to use isotropic Gaussian kernels to avoid such difficulties in the computation, leading to a higher performance method. The experiments confirm that the proposed method is about {\bf 100X} faster without losing the geometry representation accuracy. The proposed method can be applied in a large range applications where the radiance field is needed, such as 3D reconstruction, view synthesis, and dynamic object modeling.
- [1908] arXiv:2403.14246 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: CATSE: A Context-Aware Framework for Causal Target Sound ExtractionComments: Submitted to EUSIPCO 2024Subjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI)
Abstract: Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE models suitable for real-time processing. First, we explore the utility of context by providing the TSE model with oracle information about what sound classes make up the input mixture, where the objective of the model is to extract one or more sources of interest indicated by the user. Since the practical applications of oracle models are limited due to their assumptions, we introduce a composite multi-task training objective involving separation and classification losses. Our evaluation involving single- and multi-source extraction shows the benefit of using context information in the model either by means of providing full context or via the proposed multi-task training loss without the need for full context information. Specifically, we show that our proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE.
- [1909] arXiv:2403.14252 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document UnderstandingComments: LREC-COLING 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained significant attention due to their importance. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. However, these methods require fine-tuning for each task and dataset, and the models are expensive to train and operate. To overcome this limitation, we propose a new LayoutLLM that integrates these with large-scale language models (LLMs). By leveraging the strengths of existing research in document image understanding and LLMs' superior language understanding capabilities, the proposed model, fine-tuned with multimodal instruction datasets, performs an understanding of document images in a single model. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
- [1910] arXiv:2403.14264 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: A Framework for Portrait Stylization with Skin-Tone Awareness and Nudity IdentificationComments: Accepted to ICASSP 2024Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Portrait stylization is a challenging task involving the transformation of an input portrait image into a specific style while preserving its inherent characteristics. The recent introduction of Stable Diffusion (SD) has significantly improved the quality of outcomes in this field. However, a practical stylization framework that can effectively filter harmful input content and preserve the distinct characteristics of an input, such as skin-tone, while maintaining the quality of stylization remains lacking. These challenges have hindered the wide deployment of such a framework. To address these issues, this study proposes a portrait stylization framework that incorporates a nudity content identification module (NCIM) and a skin-tone-aware portrait stylization module (STAPSM). In experiments, NCIM showed good performance in enhancing explicit content filtering, and STAPSM accurately represented a diverse range of skin tones. Our proposed framework has been successfully deployed in practice, and it has effectively satisfied critical requirements of real-world applications.
- [1911] arXiv:2403.14273 (cross-list from cs.NE) [ pdf , ps , html , other ]
-
Title: Reactor Optimization Benchmark by Reinforcement LearningSubjects: Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI)
Abstract: Neutronic calculations for reactors are a daunting task when using Monte Carlo (MC) methods. As high-performance computing has advanced, the simulation of a reactor is nowadays more readily done, but design and optimization with multiple parameters is still a computational challenge. MC transport simulations, coupled with machine learning techniques, offer promising avenues for enhancing the efficiency and effectiveness of nuclear reactor optimization. This paper introduces a novel benchmark problem within the OpenNeoMC framework designed specifically for reinforcement learning. The benchmark involves optimizing a unit cell of a research reactor with two varying parameters (fuel density and water spacing) to maximize neutron flux while maintaining reactor criticality. The test case features distinct local optima, representing different physical regimes, thus posing a challenge for learning algorithms. Through extensive simulations utilizing evolutionary and neuroevolutionary algorithms, we demonstrate the effectiveness of reinforcement learning in navigating complex optimization landscapes with strict constraints. Furthermore, we propose acceleration techniques within the OpenNeoMC framework, including model updating and cross-section usage by RAM utilization, to expedite simulation times. Our findings emphasize the importance of machine learning integration in reactor optimization and contribute to advancing methodologies for addressing intricate optimization challenges in nuclear engineering. The sources of this work are available at our GitHub repository: this https URL
- [1912] arXiv:2403.14274 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Multi-role Consensus through LLMs Discussions for Vulnerability DetectionSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI)
Abstract: Recent advancements in large language models (LLMs) have highlighted the potential for vulnerability detection, a crucial component of software quality assurance. Despite this progress, most studies have been limited to the perspective of a single role, usually testers, lacking diverse viewpoints from different roles in a typical software development life-cycle, including both developers and testers. To this end, this paper introduces a multi-role approach to employ LLMs to act as different roles simulating a real-life code review process and engaging in discussions toward a consensus on the existence and classification of vulnerabilities in the code. Preliminary evaluation of this approach indicates a 13.48% increase in the precision rate, an 18.25% increase in the recall rate, and a 16.13% increase in the F1 score.
- [1913] arXiv:2403.14282 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: How to be fair? A study of label and selection biasJournal-ref: Machine Learning 112.12 (2023): 5081-5104Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Abstract: It is widely accepted that biased data leads to biased and thus potentially unfair models. Therefore, several measures for bias in data and model predictions have been proposed, as well as bias mitigation techniques whose aim is to learn models that are fair by design. Despite the myriad of mitigation techniques developed in the past decade, however, it is still poorly understood under what circumstances which methods work. Recently, Wick et al. showed, with experiments on synthetic data, that there exist situations in which bias mitigation techniques lead to more accurate models when measured on unbiased data. Nevertheless, in the absence of a thorough mathematical analysis, it remains unclear which techniques are effective under what circumstances. We propose to address this problem by establishing relationships between the type of bias and the effectiveness of a mitigation technique, where we categorize the mitigation techniques by the bias measure they optimize. In this paper we illustrate this principle for label and selection bias on the one hand, and demographic parity and ``We're All Equal'' on the other hand. Our theoretical analysis allows to explain the results of Wick et al. and we also show that there are situations where minimizing fairness measures does not result in the fairest possible distribution.
- [1914] arXiv:2403.14287 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Enhancing Historical Image Retrieval with Compositional CuesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Abstract: In analyzing vast amounts of digitally stored historical image data, existing content-based retrieval methods often overlook significant non-semantic information, limiting their effectiveness for flexible exploration across varied themes. To broaden the applicability of image retrieval methods for diverse purposes and uncover more general patterns, we innovatively introduce a crucial factor from computational aesthetics, namely image composition, into this topic. By explicitly integrating composition-related information extracted by CNN into the designed retrieval model, our method considers both the image's composition rules and semantic information. Qualitative and quantitative experiments demonstrate that the image retrieval network guided by composition information outperforms those relying solely on content information, facilitating the identification of images in databases closer to the target image in human perception. Please visit this https URL to try our codes.
- [1915] arXiv:2403.14297 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Impact Assessment of Missing Data in Model Predictions for Earth Observation ApplicationsComments: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Earth observation (EO) applications involving complex and heterogeneous data sources are commonly approached with machine learning models. However, there is a common assumption that data sources will be persistently available. Different situations could affect the availability of EO sources, like noise, clouds, or satellite mission failures. In this work, we assess the impact of missing temporal and static EO sources in trained models across four datasets with classification and regression tasks. We compare the predictive quality of different methods and find that some are naturally more robust to missing data. The Ensemble strategy, in particular, achieves a prediction robustness up to 100%. We evidence that missing scenarios are significantly more challenging in regression than classification tasks. Finally, we find that the optical view is the most critical view when it is missing individually.
- [1916] arXiv:2403.14298 (cross-list from cs.SI) [ pdf , ps , html , other ]
-
Title: From Perils to Possibilities: Understanding how Human (and AI) Biases affect Online ForaVirginia Morini , Valentina Pansanella , Katherine Abramski , Erica Cau , Andrea Failla , Salvatore Citraro , Giulio RossettiSubjects: Social and Information Networks (cs.SI) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Social media platforms are online fora where users engage in discussions, share content, and build connections. This review explores the dynamics of social interactions, user-generated contents, and biases within the context of social media analysis (analyzing works that use the tools offered by complex network analysis and natural language processing) through the lens of three key points of view: online debates, online support, and human-AI interactions. On the one hand, we delineate the phenomenon of online debates, where polarization, misinformation, and echo chamber formation often proliferate, driven by algorithmic biases and extreme mechanisms of homophily. On the other hand, we explore the emergence of online support groups through users' self-disclosure and social support mechanisms. Online debates and support mechanisms present a duality of both perils and possibilities within social media; perils of segregated communities and polarized debates, and possibilities of empathy narratives and self-help groups. This dichotomy also extends to a third perspective: users' reliance on AI-generated content, such as the ones produced by Large Language Models, which can manifest both human biases hidden in training sets and non-human biases that emerge from their artificial neural architectures. Analyzing interdisciplinary approaches, we aim to deepen the understanding of the complex interplay between social interactions, user-generated content, and biases within the realm of social media ecosystems.
- [1917] arXiv:2403.14300 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: DexDribbler: Learning Dexterous Soccer Manipulation via Dynamic SupervisionComments: 8 pages, 7 figures, submitted to IROS 2024Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI)
Abstract: Learning dexterous locomotion policy for legged robots is becoming increasingly popular due to its ability to handle diverse terrains and resemble intelligent behaviors. However, joint manipulation of moving objects and locomotion with legs, such as playing soccer, receive scant attention in the learning community, although it is natural for humans and smart animals. A key challenge to solve this multitask problem is to infer the objectives of locomotion from the states and targets of the manipulated objects. The implicit relation between the object states and robot locomotion can be hard to capture directly from the training experience. We propose adding a feedback control block to compute the necessary body-level movement accurately and using the outputs as dynamic joint-level locomotion supervision explicitly. We further utilize an improved ball dynamic model, an extended context-aided estimator, and a comprehensive ball observer to facilitate transferring policy learned in simulation to the real world. We observe that our learning scheme can not only make the policy network converge faster but also enable soccer robots to perform sophisticated maneuvers like sharp cuts and turns on flat surfaces, a capability that was lacking in previous methods. Video and code are available at this https URL
- [1918] arXiv:2403.14328 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Distilling Reinforcement Learning Policies for Interpretable Robot Locomotion: Gradient Boosting Machines and Symbolic RegressionSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recent advancements in reinforcement learning (RL) have led to remarkable achievements in robot locomotion capabilities. However, the complexity and ``black-box'' nature of neural network-based RL policies hinder their interpretability and broader acceptance, particularly in applications demanding high levels of safety and reliability. This paper introduces a novel approach to distill neural RL policies into more interpretable forms using Gradient Boosting Machines (GBMs), Explainable Boosting Machines (EBMs) and Symbolic Regression. By leveraging the inherent interpretability of generalized additive models, decision trees, and analytical expressions, we transform opaque neural network policies into more transparent ``glass-box'' models. We train expert neural network policies using RL and subsequently distill them into (i) GBMs, (ii) EBMs, and (iii) symbolic policies. To address the inherent distribution shift challenge of behavioral cloning, we propose to use the Dataset Aggregation (DAgger) algorithm with a curriculum of episode-dependent alternation of actions between expert and distilled policies, to enable efficient distillation of feedback control policies. We evaluate our approach on various robot locomotion gaits -- walking, trotting, bounding, and pacing -- and study the importance of different observations in joint actions for distilled policies using various methods. We train neural expert policies for 205 hours of simulated experience and distill interpretable policies with only 10 minutes of simulated interaction for each gait using the proposed method.
- [1919] arXiv:2403.14339 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: $\nabla \tau$: Gradient-based and Task-Agnostic machine UnlearningComments: 14 pages, 2 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Machine Unlearning, the process of selectively eliminating the influence of certain data examples used during a model's training, has gained significant attention as a means for practitioners to comply with recent data protection regulations. However, existing unlearning methods face critical drawbacks, including their prohibitively high cost, often associated with a large number of hyperparameters, and the limitation of forgetting only relatively small data portions. This often makes retraining the model from scratch a quicker and more effective solution. In this study, we introduce Gradient-based and Task-Agnostic machine Unlearning ($\nabla \tau$), an optimization framework designed to remove the influence of a subset of training data efficiently. It applies adaptive gradient ascent to the data to be forgotten while using standard gradient descent for the remaining data. $\nabla \tau$ offers multiple benefits over existing approaches. It enables the unlearning of large sections of the training dataset (up to 30%). It is versatile, supporting various unlearning tasks (such as subset forgetting or class removal) and applicable across different domains (images, text, etc.). Importantly, $\nabla \tau$ requires no hyperparameter adjustments, making it a more appealing option than retraining the model from scratch. We evaluate our framework's effectiveness using a set of well-established Membership Inference Attack metrics, demonstrating up to 10% enhancements in performance compared to state-of-the-art methods without compromising the original model's accuracy.
- [1920] arXiv:2403.14340 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Exploring Task Unification in Graph Representation Learning via Generative ApproachSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Graphs are ubiquitous in real-world scenarios and encompass a diverse range of tasks, from node-, edge-, and graph-level tasks to transfer learning. However, designing specific tasks for each type of graph data is often costly and lacks generalizability. Recent endeavors under the "Pre-training + Fine-tuning" or "Pre-training + Prompt" paradigms aim to design a unified framework capable of generalizing across multiple graph tasks. Among these, graph autoencoders (GAEs), generative self-supervised models, have demonstrated their potential in effectively addressing various graph tasks. Nevertheless, these methods typically employ multi-stage training and require adaptive designs, which on one hand make it difficult to be seamlessly applied to diverse graph tasks and on the other hand overlook the negative impact caused by discrepancies in task objectives between the different stages. To address these challenges, we propose GA^2E, a unified adversarially masked autoencoder capable of addressing the above challenges seamlessly. Specifically, GA^2E proposes to use the subgraph as the meta-structure, which remains consistent across all graph tasks (ranging from node-, edge-, and graph-level to transfer learning) and all stages (both during training and inference). Further, GA^2E operates in a \textbf{"Generate then Discriminate"} manner. It leverages the masked GAE to reconstruct the input subgraph whilst treating it as a generator to compel the reconstructed graphs resemble the input subgraph. Furthermore, GA^2E introduces an auxiliary discriminator to discern the authenticity between the reconstructed (generated) subgraph and the input subgraph, thus ensuring the robustness of the graph representation through adversarial training mechanisms. We validate GA^2E's capabilities through extensive experiments on 21 datasets across four types of graph tasks.
- [1921] arXiv:2403.14358 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Exploring the Potential of Large Language Models in Graph GenerationYang Yao , Xin Wang , Zeyang Zhang , Yijian Qin , Ziwei Zhang , Xu Chu , Yuekui Yang , Wenwu Zhu , Hong MeiSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Abstract: Large language models (LLMs) have achieved great success in many fields, and recent works have studied exploring LLMs for graph discriminative tasks such as node classification. However, the abilities of LLMs for graph generation remain unexplored in the literature. Graph generation requires the LLM to generate graphs with given properties, which has valuable real-world applications such as drug discovery, while tends to be more challenging. In this paper, we propose LLM4GraphGen to explore the ability of LLMs for graph generation with systematical task designs and extensive experiments. Specifically, we propose several tasks tailored with comprehensive experiments to address key questions regarding LLMs' understanding of different graph structure rules, their ability to capture structural type distributions, and their utilization of domain knowledge for property-based graph generation. Our evaluations demonstrate that LLMs, particularly GPT-4, exhibit preliminary abilities in graph generation tasks, including rule-based and distribution-based generation. We also observe that popular prompting methods, such as few-shot and chain-of-thought prompting, do not consistently enhance performance. Besides, LLMs show potential in generating molecules with specific properties. These findings may serve as foundations for designing good LLMs based models for graph generation and provide valuable insights and further research.
- [1922] arXiv:2403.14371 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Loop Improvement: An Efficient Approach for Extracting Shared Features from Heterogeneous Data without Central ServerComments: 11 pages, 11 figuresSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: In federated learning, data heterogeneity significantly impacts performance. A typical solution involves segregating these parameters into shared and personalized components, a concept also relevant in multi-task learning. Addressing this, we propose "Loop Improvement" (LI), a novel method enhancing this separation and feature extraction without necessitating a central server or data interchange among participants. Our experiments reveal LI's superiority in several aspects: In personalized federated learning environments, LI consistently outperforms the advanced FedALA algorithm in accuracy across diverse scenarios. Additionally, LI's feature extractor closely matches the performance achieved when aggregating data from all clients. In global model contexts, employing LI with stacked personalized layers and an additional network also yields comparable results to combined client data scenarios. Furthermore, LI's adaptability extends to multi-task learning, streamlining the extraction of common features across tasks and obviating the need for simultaneous training. This approach not only enhances individual task performance but also achieves accuracy levels on par with classic multi-task learning methods where all tasks are trained simultaneously. LI integrates a loop topology with layer-wise and end-to-end training, compatible with various neural network models. This paper also delves into the theoretical underpinnings of LI's effectiveness, offering insights into its potential applications. The code is on this https URL
- [1923] arXiv:2403.14377 (cross-list from cs.IR) [ pdf , ps , html , other ]
-
Title: Knowledge-Enhanced Recommendation with User-Centric Subgraph NetworkSubjects: Information Retrieval (cs.IR) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Recommendation systems, as widely implemented nowadays on various platforms, recommend relevant items to users based on their preferences. The classical methods which rely on user-item interaction matrices has limitations, especially in scenarios where there is a lack of interaction data for new items. Knowledge graph (KG)-based recommendation systems have emerged as a promising solution. However, most KG-based methods adopt node embeddings, which do not provide personalized recommendations for different users and cannot generalize well to the new items. To address these limitations, we propose Knowledge-enhanced User-Centric subgraph Network (KUCNet), a subgraph learning approach with graph neural network (GNN) for effective recommendation. KUCNet constructs a U-I subgraph for each user-item pair that captures both the historical information of user-item interactions and the side information provided in KG. An attention-based GNN is designed to encode the U-I subgraphs for recommendation. Considering efficiency, the pruned user-centric computation graph is further introduced such that multiple U-I subgraphs can be simultaneously computed and that the size can be pruned by Personalized PageRank. Our proposed method achieves accurate, efficient, and interpretable recommendations especially for new items. Experimental results demonstrate the superiority of KUCNet over state-of-the-art KG-based and collaborative filtering (CF)-based methods.
- [1924] arXiv:2403.14381 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Editing Knowledge Representation of Language Lodel via Rephrased Prefix PromptsComments: 19pages,3figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Neural language models (LMs) have been extensively trained on vast corpora to store factual knowledge about various aspects of the world described in texts. Current technologies typically employ knowledge editing methods or specific prompts to modify LM outputs. However, existing knowledge editing methods are costly and inefficient, struggling to produce appropriate text. Additionally, prompt engineering is opaque and requires significant effort to find suitable prompts. To address these issues, we introduce a new method called PSPEM (Prefix Soft Prompt Editing Method), that can be used for a lifetime with just one training. It resolves the inefficiencies and generalizability issues in knowledge editing methods and overcomes the opacity of prompt engineering by automatically seeking optimal soft prompts. Specifically, PSPEM utilizes a prompt encoder and an encoding converter to refine key information in prompts and uses prompt alignment techniques to guide model generation, ensuring text consistency and adherence to the intended structure and content, thereby maintaining an optimal balance between efficiency and accuracy. We have validated the effectiveness of PSPEM through knowledge editing and attribute inserting. On the COUNTERFACT dataset, PSPEM achieved nearly 100\% editing accuracy and demonstrated the highest level of fluency. We further analyzed the similarities between PSPEM and original prompts and their impact on the model's internals. The results indicate that PSPEM can serve as an alternative to original prompts, supporting the model in effective editing.
- [1925] arXiv:2403.14399 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Building Accurate Translation-Tailored LLMs with Language Aware Instruction TuningSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Translation-tailored Large language models (LLMs) exhibit remarkable translation capabilities, even competing with supervised-trained commercial translation systems. However, off-target translation remains an unsolved problem, especially for low-resource languages, hindering us from developing accurate LLMs-based translation models. To mitigate the off-target translation problem and enhance the performance of LLMs on translation, recent works have either designed advanced prompting strategies to highlight the functionality of translation instructions or exploited the in-context learning ability of LLMs by feeding few-shot demonstrations. However, these methods essentially do not improve LLM's ability to follow translation instructions, especially the language direction information. In this work, we design a two-stage fine-tuning algorithm to improve the instruction-following ability (especially the translation direction) of LLMs. Specifically, we first tune LLMs with the maximum likelihood estimation loss on the translation dataset to elicit the basic translation capabilities. In the second stage, we construct instruction-conflicting samples by randomly replacing the translation directions with a wrong one within the instruction, and then introduce an extra unlikelihood loss to learn those samples. Experiments on IWSLT and WMT benchmarks upon the LLaMA model spanning 16 zero-shot directions show that, compared to the competitive baseline -- translation-finetuned LLama, our method could effectively reduce the off-target translation ratio (averagely -53.3\%), thus improving translation quality with average +5.7 SacreBLEU and +16.4 BLEURT. Analysis shows that our method could preserve the model's general task performance on AlpacaEval. Code and models will be released at \url{ this https URL }.
- [1926] arXiv:2403.14403 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question ComplexityComments: NAACL 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Retrieval-Augmented Large Language Models (LLMs), which incorporate the non-parametric knowledge from external knowledge bases into LLMs, have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA). However, even though there are various approaches dealing with queries of different complexities, they either handle simple queries with unnecessary computational overhead or fail to adequately address complex multi-step queries; yet, not all user requests fall into only one of the simple or complex categories. In this work, we propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs from the simplest to the most sophisticated ones based on the query complexity. Also, this selection process is operationalized with a classifier, which is a smaller LM trained to predict the complexity level of incoming queries with automatically collected labels, obtained from actual predicted outcomes of models and inherent inductive biases in datasets. This approach offers a balanced strategy, seamlessly adapting between the iterative and single-step retrieval-augmented LLMs, as well as the no-retrieval methods, in response to a range of query complexities. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems, compared to relevant baselines including the adaptive retrieval approaches. Code is available at: this https URL .
- [1927] arXiv:2403.14409 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Locating and Mitigating Gender Bias in Large Language ModelsComments: 23 pages, 5 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Large language models(LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences. However, this process can inadvertently lead to these models acquiring biases and stereotypes prevalent in society. Prior research has typically tackled the issue of bias through a one-dimensional perspective, concentrating either on locating or mitigating it. This limited perspective has created obstacles in facilitating research on bias to synergistically complement and progressively build upon one another. In this study, we integrate the processes of locating and mitigating bias within a unified framework. Initially, we use causal mediation analysis to trace the causal effects of different components' activation within a large language model. Building on this, we propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns, and compare it against two baselines on three gender bias datasets and seven knowledge competency test datasets. The experimental results indicate that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns and the top attention module acting on the final word in the sentence. Furthermore, LSDM mitigates gender bias in the model more effectively than the other baselines, while fully preserving the model's capabilities in all other aspects.
- [1928] arXiv:2403.14410 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: GLC++: Source-Free Universal Domain Adaptation through Global-Local Clustering and Contrastive Affinity LearningComments: This is a substantial extension of the CVPR 2023 paper "Upcycling Models under Domain and Category Shift"Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep neural networks often exhibit sub-optimal performance under covariate and category shifts. Source-Free Domain Adaptation (SFDA) presents a promising solution to this dilemma, yet most SFDA approaches are restricted to closed-set scenarios. In this paper, we explore Source-Free Universal Domain Adaptation (SF-UniDA) aiming to accurately classify "known" data belonging to common categories and segregate them from target-private "unknown" data. We propose a novel Global and Local Clustering (GLC) technique, which comprises an adaptive one-vs-all global clustering algorithm to discern between target classes, complemented by a local k-NN clustering strategy to mitigate negative transfer. Despite the effectiveness, the inherent closed-set source architecture leads to uniform treatment of "unknown" data, impeding the identification of distinct "unknown" categories. To address this, we evolve GLC to GLC++, integrating a contrastive affinity learning strategy. We examine the superiority of GLC and GLC++ across multiple benchmarks and category shift scenarios. Remarkably, in the most challenging open-partial-set scenarios, GLC and GLC++ surpass GATE by 16.7% and 18.6% in H-score on VisDA, respectively. GLC++ enhances the novel category clustering accuracy of GLC by 4.3% in open-set scenarios on Office-Home. Furthermore, the introduced contrastive learning strategy not only enhances GLC but also significantly facilitates existing methodologies.
- [1929] arXiv:2403.14429 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: Style-Extracting Diffusion Models for Semi-Supervised Histopathology SegmentationMathias Öttl , Frauke Wilm , Jana Steenpass , Jingna Qiu , Matthias Rübner , Arndt Hartmann , Matthias Beckmann , Peter Fasching , Andreas Maier , Ramona Erber , Bernhard Kainz , Katharina BreiningerSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Deep learning-based image generation has seen significant advancements with diffusion models, notably improving the quality of generated images. Despite these developments, generating images with unseen characteristics beneficial for downstream tasks has received limited attention. To bridge this gap, we propose Style-Extracting Diffusion Models, featuring two conditioning mechanisms. Specifically, we utilize 1) a style conditioning mechanism which allows to inject style information of previously unseen images during image generation and 2) a content conditioning which can be targeted to a downstream task, e.g., layout for segmentation. We introduce a trainable style encoder to extract style information from images, and an aggregation block that merges style information from multiple style inputs. This architecture enables the generation of images with unseen styles in a zero-shot manner, by leveraging styles from unseen images, resulting in more diverse generations. In this work, we use the image layout as target condition and first show the capability of our method on a natural image dataset as a proof-of-concept. We further demonstrate its versatility in histopathology, where we combine prior knowledge about tissue composition and unannotated data to create diverse synthetic images with known layouts. This allows us to generate additional synthetic data to train a segmentation network in a semi-supervised fashion. We verify the added value of the generated images by showing improved segmentation results and lower performance variability between patients when synthetic images are included during segmentation training. Our code will be made publicly available at [LINK].
- [1930] arXiv:2403.14432 (cross-list from eess.SY) [ pdf , ps , html , other ]
-
Title: On the continuity and smoothness of the value function in reinforcement learning and optimal controlSubjects: Systems and Control (eess.SY) ; Artificial Intelligence (cs.AI)
Abstract: The value function plays a crucial role as a measure for the cumulative future reward an agent receives in both reinforcement learning and optimal control. It is therefore of interest to study how similar the values of neighboring states are, i.e., to investigate the continuity of the value function. We do so by providing and verifying upper bounds on the value function's modulus of continuity. Additionally, we show that the value function is always Hölder continuous under relatively weak assumptions on the underlying system and that non-differentiable value functions can be made differentiable by slightly "disturbing" the system.
- [1931] arXiv:2403.14435 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Biased Binary Attribute Classifiers Ignore the Majority ClassesSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: To visualize the regions of interest that classifiers base their decisions on, different Class Activation Mapping (CAM) methods have been developed. However, all of these techniques target categorical classifiers only, though most real-world tasks are binary classification. In this paper, we extend gradient-based CAM techniques to work with binary classifiers and visualize the active regions for binary facial attribute classifiers. When training an unbalanced binary classifier on an imbalanced dataset, it is well-known that the majority class, i.e. the class with many training samples, is mostly predicted much better than minority class with few training instances. In our experiments on the CelebA dataset, we verify these results, when training an unbalanced classifier to extract 40 facial attributes simultaneously. One would expect that the biased classifier has learned to extract features mainly for the majority classes and that the proportional energy of the activations mainly reside in certain specific regions of the image where the attribute is located. However, we find very little regular activation for samples of majority classes, while the active regions for minority classes seem mostly reasonable and overlap with our expectations. These results suggest that biased classifiers mainly rely on bias activation for majority classes. When training a balanced classifier on the imbalanced data by employing attribute-specific class weights, majority and minority classes are classified similarly well and show expected activations for almost all attributes
- [1932] arXiv:2403.14440 (cross-list from eess.IV) [ pdf , ps , html , other ]
-
Title: Analysing Diffusion Segmentation for Medical ImagesMathias Öttl , Siyuan Mei , Frauke Wilm , Jana Steenpass , Matthias Rübner , Arndt Hartmann , Matthias Beckmann , Peter Fasching , Andreas Maier , Ramona Erber , Katharina BreiningerSubjects: Image and Video Processing (eess.IV) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract: Denoising Diffusion Probabilistic models have become increasingly popular due to their ability to offer probabilistic modeling and generate diverse outputs. This versatility inspired their adaptation for image segmentation, where multiple predictions of the model can produce segmentation results that not only achieve high quality but also capture the uncertainty inherent in the model. Here, powerful architectures were proposed for improving diffusion segmentation performance. However, there is a notable lack of analysis and discussions on the differences between diffusion segmentation and image generation, and thorough evaluations are missing that distinguish the improvements these architectures provide for segmentation in general from their benefit for diffusion segmentation specifically. In this work, we critically analyse and discuss how diffusion segmentation for medical images differs from diffusion image generation, with a particular focus on the training behavior. Furthermore, we conduct an assessment how proposed diffusion segmentation architectures perform when trained directly for segmentation. Lastly, we explore how different medical segmentation tasks influence the diffusion segmentation behavior and the diffusion process could be adapted accordingly. With these analyses, we aim to provide in-depth insights into the behavior of diffusion segmentation that allow for a better design and evaluation of diffusion segmentation methods in the future.
- [1933] arXiv:2403.14459 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Multi-Level Explanations for Generative Language ModelsLucas Monteiro Paes , Dennis Wei , Hyo Jin Do , Hendrik Strobelt , Ronny Luss , Amit Dhurandhar , Manish Nagireddy , Karthikeyan Natesan Ramamurthy , Prasanna Sattigeri , Werner Geyer , Soumya GhoshSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: Perturbation-based explanation methods such as LIME and SHAP are commonly applied to text classification. This work focuses on their extension to generative language models. To address the challenges of text as output and long text inputs, we propose a general framework called MExGen that can be instantiated with different attribution algorithms. To handle text output, we introduce the notion of scalarizers for mapping text to real numbers and investigate multiple possibilities. To handle long inputs, we take a multi-level approach, proceeding from coarser levels of granularity to finer ones, and focus on algorithms with linear scaling in model queries. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and context-grounded question answering. The results show that our framework can provide more locally faithful explanations of generated outputs.
- [1934] arXiv:2403.14460 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Towards Single-System Illusion in Software-Defined Vehicles -- Automated, AI-Powered WorkflowKrzysztof Lebioda , Viktor Vorobev , Nenad Petrovic , Fengjunjie Pan , Vahid Zolfaghari , Alois KnollSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: We propose a novel model- and feature-based approach to development of vehicle software systems, where the end architecture is not explicitly defined. Instead, it emerges from an iterative process of search and optimization given certain constraints, requirements and hardware architecture, while retaining the property of single-system illusion, where applications run in a logically uniform environment. One of the key points of the presented approach is the inclusion of modern generative AI, specifically Large Language Models (LLMs), in the loop. With the recent advances in the field, we expect that the LLMs will be able to assist in processing of requirements, generation of formal system models, as well as generation of software deployment specification and test code. The resulting pipeline is automated to a large extent, with feedback being generated at each step.
- [1935] arXiv:2403.14468 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing TasksComments: preprintSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Abstract: Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.
- [1936] arXiv:2403.14469 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: ChatGPT Alternative Solutions: Large Language Models SurveyJournal-ref: David C. Wyld et al. (Eds): NBIoT, MLCL, NMCO, ARIN, CSITA, ISPR, NATAP-2024. pp. 153-173, 2024. CS & IT - CSCP 2024Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: In recent times, the grandeur of Large Language Models (LLMs) has not only shone in the realm of natural language processing but has also cast its brilliance across a vast array of applications. This remarkable display of LLM capabilities has ignited a surge in research contributions within this domain, spanning a diverse spectrum of topics. These contributions encompass advancements in neural network architecture, context length enhancements, model alignment, training datasets, benchmarking, efficiency improvements, and more. Recent years have witnessed a dynamic synergy between academia and industry, propelling the field of LLM research to new heights. A notable milestone in this journey is the introduction of ChatGPT, a powerful AI chatbot grounded in LLMs, which has garnered widespread societal attention. The evolving technology of LLMs has begun to reshape the landscape of the entire AI community, promising a revolutionary shift in the way we create and employ AI algorithms. Given this swift-paced technical evolution, our survey embarks on a journey to encapsulate the recent strides made in the world of LLMs. Through an exploration of the background, key discoveries, and prevailing methodologies, we offer an up-to-the-minute review of the literature. By examining multiple LLM models, our paper not only presents a comprehensive overview but also charts a course that identifies existing challenges and points toward potential future research trajectories. This survey furnishes a well-rounded perspective on the current state of generative AI, shedding light on opportunities for further exploration, enhancement, and innovation.
- [1937] arXiv:2403.14472 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Detoxifying Large Language Models via Knowledge EditingMengru Wang , Ningyu Zhang , Ziwen Xu , Zekun Xi , Shumin Deng , Yunzhi Yao , Qishen Zhang , Linyi Yang , Jindong Wang , Huajun ChenComments: Ongoing work. Project website: this https URL Add and update experimental results in Tables 1 and 3Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to efficiently detoxify LLMs with limited impact on general performance. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxifying approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at this https URL .
- [1938] arXiv:2403.14483 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Utilizing the LightGBM Algorithm for Operator User Credit Assessment ResearchSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
Abstract: Mobile Internet user credit assessment is an important way for communication operators to establish decisions and formulate measures, and it is also a guarantee for operators to obtain expected benefits. However, credit evaluation methods have long been monopolized by financial industries such as banks and credit. As supporters and providers of platform network technology and network resources, communication operators are also builders and maintainers of communication networks. Internet data improves the user's credit evaluation strategy. This paper uses the massive data provided by communication operators to carry out research on the operator's user credit evaluation model based on the fusion LightGBM algorithm. First, for the massive data related to user evaluation provided by operators, key features are extracted by data preprocessing and feature engineering methods, and a multi-dimensional feature set with statistical significance is constructed; then, linear regression, decision tree, LightGBM, and other machine learning algorithms build multiple basic models to find the best basic model; finally, integrates Averaging, Voting, Blending, Stacking and other integrated algorithms to refine multiple fusion models, and finally establish the most suitable fusion model for operator user evaluation.
- [1939] arXiv:2403.14484 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: HyperGALE: ASD Classification via Hypergraph Gated Attention with Learnable HyperedgesComments: Accepted to IJCNN 2024Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Abstract: Autism Spectrum Disorder (ASD) is a neurodevelopmental condition characterized by varied social cognitive challenges and repetitive behavioral patterns. Identifying reliable brain imaging-based biomarkers for ASD has been a persistent challenge due to the spectrum's diverse symptomatology. Existing baselines in the field have made significant strides in this direction, yet there remains room for improvement in both performance and interpretability. We propose \emph{HyperGALE}, which builds upon the hypergraph by incorporating learned hyperedges and gated attention mechanisms. This approach has led to substantial improvements in the model's ability to interpret complex brain graph data, offering deeper insights into ASD biomarker characterization. Evaluated on the extensive ABIDE II dataset, \emph{HyperGALE} not only improves interpretability but also demonstrates statistically significant enhancements in key performance metrics compared to both previous baselines and the foundational hypergraph model. The advancement \emph{HyperGALE} brings to ASD research highlights the potential of sophisticated graph-based techniques in neurodevelopmental studies. The source code and implementation instructions are available at GitHub: this https URL .
- [1940] arXiv:2403.14488 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Physics-Based Causal Reasoning for Safe & Robust Next-Best Action Selection in Robot Manipulation TasksComments: 8 pages, 9 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
Abstract: Safe and efficient object manipulation is a key enabler of many real-world robot applications. However, this is challenging because robot operation must be robust to a range of sensor and actuator uncertainties. In this paper, we present a physics-informed causal-inference-based framework for a robot to probabilistically reason about candidate actions in a block stacking task in a partially observable setting. We integrate a physics-based simulation of the rigid-body system dynamics with a causal Bayesian network (CBN) formulation to define a causal generative probabilistic model of the robot decision-making process. Using simulation-based Monte Carlo experiments, we demonstrate our framework's ability to successfully: (1) predict block tower stability with high accuracy (Pred Acc: 88.6%); and, (2) select an approximate next-best action for the block stacking task, for execution by an integrated robot system, achieving 94.2% task success rate. We also demonstrate our framework's suitability for real-world robot systems by demonstrating successful task executions with a domestic support robot, with perception and manipulation sub-system integration. Hence, we show that by embedding physics-based causal reasoning into robots' decision-making processes, we can make robot task execution safer, more reliable, and more robust to various types of uncertainty.
- [1941] arXiv:2403.14494 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Learning to Project for Cross-Task Knowledge DistillationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Traditional knowledge distillation (KD) relies on a proficient teacher trained on the target task, which is not always available. In this setting, cross-task distillation can be used, enabling the use of any teacher model trained on a different task. However, many KD methods prove ineffective when applied to this cross-task setting. To address this limitation, we propose a simple modification: the use of an inverted projection. We show that this drop-in replacement for a standard projector is effective by learning to disregard any task-specific features which might degrade the student's performance. We find that this simple modification is sufficient for extending many KD methods to the cross-task setting, where the teacher and student tasks can be very different. In doing so, we obtain up to a 1.9% improvement in the cross-task setting compared to the traditional projection, at no additional cost. Our method can obtain significant performance improvements (up to 7%) when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer. To provide conceptual and analytical insights into this result, we show that using an inverted projection allows the distillation loss to be decomposed into a knowledge transfer and a spectral regularisation component. Through this analysis we are additionally able to propose a novel regularisation loss that allows teacher-free distillation, enabling performance improvements of up to 8.57% on ImageNet with no additional training costs.
- [1942] arXiv:2403.14496 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: How Human-Centered Explainable AI Interface Are Designed and Evaluated: A Systematic SurveySubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Despite its technological breakthroughs, eXplainable Artificial Intelligence (XAI) research has limited success in producing the {\em effective explanations} needed by users. In order to improve XAI systems' usability, practical interpretability, and efficacy for real users, the emerging area of {\em Explainable Interfaces} (EIs) focuses on the user interface and user experience design aspects of XAI. This paper presents a systematic survey of 53 publications to identify current trends in human-XAI interaction and promising directions for EI design and development. This is among the first systematic survey of EI research.
- [1943] arXiv:2403.14504 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Soft Learning Probabilistic CircuitsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Probabilistic Circuits (PCs) are prominent tractable probabilistic models, allowing for a range of exact inferences. This paper focuses on the main algorithm for training PCs, LearnSPN, a gold standard due to its efficiency, performance, and ease of use, in particular for tabular data. We show that LearnSPN is a greedy likelihood maximizer under mild assumptions. While inferences in PCs may use the entire circuit structure for processing queries, LearnSPN applies a hard method for learning them, propagating at each sum node a data point through one and only one of the children/edges as in a hard clustering process. We propose a new learning procedure named SoftLearn, that induces a PC using a soft clustering process. We investigate the effect of this learning-inference compatibility in PCs. Our experiments show that SoftLearn outperforms LearnSPN in many situations, yielding better likelihoods and arguably better samples. We also analyze comparable tractable models to highlight the differences between soft/hard learning and model querying.
- [1944] arXiv:2403.14508 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Constrained Reinforcement Learning with Smoothed Log Barrier FunctionSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Abstract: Reinforcement Learning (RL) has been widely applied to many control tasks and substantially improved the performances compared to conventional control methods in many domains where the reward function is well defined. However, for many real-world problems, it is often more convenient to formulate optimization problems in terms of rewards and constraints simultaneously. Optimizing such constrained problems via reward shaping can be difficult as it requires tedious manual tuning of reward functions with several interacting terms. Recent formulations which include constraints mostly require a pre-training phase, which often needs human expertise to collect data or assumes having a sub-optimal policy readily available. We propose a new constrained RL method called CSAC-LB (Constrained Soft Actor-Critic with Log Barrier Function), which achieves competitive performance without any pre-training by applying a linear smoothed log barrier function to an additional safety critic. It implements an adaptive penalty for policy learning and alleviates the numerical issues that are known to complicate the application of the log barrier function method. As a result, we show that with CSAC-LB, we achieve state-of-the-art performance on several constrained control tasks with different levels of difficulty and evaluate our methods in a locomotion task on a real quadruped robot platform.
- [1945] arXiv:2403.14526 (cross-list from cs.RO) [ pdf , ps , html , other ]
-
Title: Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion DescriptorsComments: 8 pages, 4 figuresSubjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation. Web page: this https URL
- [1946] arXiv:2403.14539 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Object-Centric Domain Randomization for 3D Shape Reconstruction in the WildComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: One of the biggest challenges in single-view 3D shape reconstruction in the wild is the scarcity of <3D shape, 2D image>-paired data from real-world environments. Inspired by remarkable achievements via domain randomization, we propose ObjectDR which synthesizes such paired data via a random simulation of visual variations in object appearances and backgrounds. Our data synthesis framework exploits a conditional generative model (e.g., ControlNet) to generate images conforming to spatial conditions such as 2.5D sketches, which are obtainable through a rendering process of 3D shapes from object collections (e.g., Objaverse-XL). To simulate diverse variations while preserving object silhouettes embedded in spatial conditions, we also introduce a disentangled framework which leverages an initial object guidance. After synthesizing a wide range of data, we pre-train a model on them so that it learns to capture a domain-invariant geometry prior which is consistent across various domains. We validate its effectiveness by substantially improving 3D shape reconstruction models on a real-world benchmark. In a scale-up evaluation, our pre-training achieves 23.6% superior results compared with the pre-training on high-quality computer graphics renderings.
- [1947] arXiv:2403.14550 (cross-list from cs.HC) [ pdf , ps , html , other ]
-
Title: Dynamic Explanation Emphasis in Human-XAI Interaction with Communication RobotSubjects: Human-Computer Interaction (cs.HC) ; Artificial Intelligence (cs.AI)
Abstract: Communication robots have the potential to contribute to effective human-XAI interaction as an interface that goes beyond textual or graphical explanations. One of their strengths is that they can use physical and vocal expressions to add detailed nuances to explanations. However, it is not clear how a robot can apply such expressions, or in particular, how we can develop a strategy to adaptively use such expressions depending on the task and user in dynamic interactions. To address this question, this paper proposes DynEmph, a method for a communication robot to decide where to emphasize XAI-generated explanations with physical expressions. It predicts the effect of emphasizing certain points on a user and aims to minimize the expected difference between predicted user decisions and AI-suggested ones. DynEmph features a strategy for deciding where to emphasize in a data-driven manner, relieving engineers from the need to manually design a strategy. We further conducted experiments to investigate how emphasis selection strategies affect the performance of user decisions. The results suggest that, while a naive strategy (emphasizing explanations for an AI's most probable class) does not necessarily work better, DynEmph effectively guides users to better decisions under the condition that the performance of the AI suggestion is high.
- [1948] arXiv:2403.14551 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Lexicon-Level Contrastive Visual-Grounding Improves Language ModelingSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.
- [1949] arXiv:2403.14562 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: The Era of Semantic DecodingComments: 25 pages, 3 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
Abstract: Recent work demonstrated great promise in the idea of orchestrating collaborations between LLMs, human input, and various tools to address the inherent limitations of LLMs. We propose a novel perspective called semantic decoding, which frames these collaborative processes as optimization procedures in semantic space. Specifically, we conceptualize LLMs as semantic processors that manipulate meaningful pieces of information that we call semantic tokens (known thoughts). LLMs are among a large pool of other semantic processors, including humans and tools, such as search engines or code executors. Collectively, semantic processors engage in dynamic exchanges of semantic tokens to progressively construct high-utility outputs. We refer to these orchestrated interactions among semantic processors, optimizing and searching in semantic space, as semantic decoding algorithms. This concept draws a direct parallel to the well-studied problem of syntactic decoding, which involves crafting algorithms to best exploit auto-regressive language models for extracting high-utility sequences of syntactic tokens. By focusing on the semantic level and disregarding syntactic details, we gain a fresh perspective on the engineering of AI systems, enabling us to imagine systems with much greater complexity and capabilities. In this position paper, we formalize the transition from syntactic to semantic tokens as well as the analogy between syntactic and semantic decoding. Subsequently, we explore the possibilities of optimizing within the space of semantic tokens via semantic decoding algorithms. We conclude with a list of research opportunities and questions arising from this fresh perspective. The semantic decoding perspective offers a powerful abstraction for search and optimization directly in the space of meaningful concepts, with semantic tokens as the fundamental units of a new type of computation.
- [1950] arXiv:2403.14578 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical DomainWilliam James Bolton , Rafael Poyiadzi , Edward R. Morrell , Gabriela van Bergen Gonzalez Bueno , Lea GoetzComments: Published at ICLR 2024 Workshop on Reliable and Responsible Foundation ModelsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Large Language Models (LLMs) increasingly support applications in a wide range of domains, some with potential high societal impact such as biomedicine, yet their reliability in realistic use cases is under-researched. In this work we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA) framework and evaluate whether four state-of-the-art foundation LLMs can serve as reliable assistants in the biomedical domain. We identify prompt robustness, high recall, and a lack of hallucinations as necessary criteria for this use case. We design shortform tasks and tasks requiring LLM freeform responses mimicking real-world user interactions. We evaluate LLM performance using semantic similarity with a ground truth response, through an evaluator LLM.
- [1951] arXiv:2403.14582 (cross-list from cs.CL) [ pdf , ps , html , other ]
-
Title: Large Language Models for Multi-Choice Question Classification of Medical SubjectsSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)
Abstract: The aim of this paper is to evaluate whether large language models trained on multi-choice question data can be used to discriminate between medical subjects. This is an important and challenging task for automatic question answering. To achieve this goal, we train deep neural networks for multi-class classification of questions into the inferred medical subjects. Using our Multi-Question (MQ) Sequence-BERT method, we outperform the state-of-the-art results on the MedMCQA dataset with an accuracy of 0.68 and 0.60 on their development and test sets, respectively. In this sense, we show the capability of AI and LLMs in particular for multi-classification tasks in the Healthcare domain.
- [1952] arXiv:2403.14592 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: Envisioning the Next-Generation AI Coding Assistants: Insights & ProposalsSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience developing in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open questions and challenges that academia and industry should address to realize the vision of next-generation AI coding assistants.
- [1953] arXiv:2403.14606 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: The Elements of Differentiable ProgrammingComments: Draft version 1Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Abstract: Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs.
- [1954] arXiv:2403.14617 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion InversionComments: Project page at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
- [1955] arXiv:2403.14624 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?Renrui Zhang , Dongzhi Jiang , Yichi Zhang , Haokun Lin , Ziyu Guo , Pengshuo Qiu , Aojun Zhou , Pan Lu , Kai-Wei Chang , Peng Gao , Hongsheng LiComments: 46 Pages, Work in Progress, Benchmark Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: this https URL
- [1956] arXiv:2403.14633 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Born With a Silver Spoon? Investigating Socioeconomic Bias in Large Language ModelsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Socioeconomic bias in society exacerbates disparities, influencing access to opportunities and resources based on individuals' economic and social backgrounds. This pervasive issue perpetuates systemic inequalities, hindering the pursuit of inclusive progress as a society. In this paper, we investigate the presence of socioeconomic bias, if any, in large language models. To this end, we introduce a novel dataset SilverSpoon, consisting of 3000 samples that illustrate hypothetical scenarios that involve underprivileged people performing ethically ambiguous actions due to their circumstances, and ask whether the action is ethically justified. Further, this dataset has a dual-labeling scheme and has been annotated by people belonging to both ends of the socioeconomic spectrum. Using SilverSpoon, we evaluate the degree of socioeconomic bias expressed in large language models and the variation of this degree as a function of model size. We also perform qualitative analysis to analyze the nature of this bias. Our analysis reveals that while humans disagree on which situations require empathy toward the underprivileged, most large language models are unable to empathize with the socioeconomically underprivileged regardless of the situation. To foster further research in this domain, we make SilverSpoon and our evaluation harness publicly available.
- [1957] arXiv:2403.14635 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: AI Sustainability in Practice Part One: Foundations for Sustainable AI ProjectsDavid Leslie , Cami Rincon , Morgan Briggs , Antonella Perini , Smera Jayadeva , Ann Borda , SJ Bennett , Christopher Burr , Mhairi Aitken , Michael Katell , Claudia Fischer , Janis Wong , Ismael Kherroubi GarciaSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Sustainable AI projects are continuously responsive to the transformative effects as well as short-, medium-, and long-term impacts on individuals and society that the design, development, and deployment of AI technologies may have. Projects, which centre AI Sustainability, ensure that values-led, collaborative, and anticipatory reflection both guides the assessment of potential social and ethical impacts and steers responsible innovation practices.
This workbook is the first part of a pair that provides the concepts and tools needed to put AI Sustainability into practice. It introduces the SUM Values, which help AI project teams to assess the potential societal impacts and ethical permissibility of their projects. It then presents a Stakeholder Engagement Process (SEP), which provides tools to facilitate proportionate engagement of and input from stakeholders with an emphasis on equitable and meaningful participation and positionality awareness. - [1958] arXiv:2403.14636 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: AI Fairness in PracticeDavid Leslie , Cami Rincon , Morgan Briggs , Antonella Perini , Smera Jayadeva , Ann Borda , SJ Bennett , Christopher Burr , Mhairi Aitken , Michael Katell , Claudia Fischer , Janis Wong , Ismael Kherroubi GarciaSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Abstract: Reaching consensus on a commonly accepted definition of AI Fairness has long been a central challenge in AI ethics and governance. There is a broad spectrum of views across society on what the concept of fairness means and how it should best be put to practice. In this workbook, we tackle this challenge by exploring how a context-based and society-centred approach to understanding AI Fairness can help project teams better identify, mitigate, and manage the many ways that unfair bias and discrimination can crop up across the AI project workflow.
We begin by exploring how, despite the plurality of understandings about the meaning of fairness, priorities of equality and non-discrimination have come to constitute the broadly accepted core of its application as a practical principle. We focus on how these priorities manifest in the form of equal protection from direct and indirect discrimination and from discriminatory harassment. These elements form ethical and legal criteria based upon which instances of unfair bias and discrimination can be identified and mitigated across the AI project workflow.
We then take a deeper dive into how the different contexts of the AI project lifecycle give rise to different fairness concerns. This allows us to identify several types of AI Fairness (Data Fairness, Application Fairness, Model Design and Development Fairness, Metric-Based Fairness, System Implementation Fairness, and Ecosystem Fairness) that form the basis of a multi-lens approach to bias identification, mitigation, and management. Building on this, we discuss how to put the principle of AI Fairness into practice across the AI project workflow through Bias Self-Assessment and Bias Risk Management as well as through the documentation of metric-based fairness criteria in a Fairness Position Statement. - [1959] arXiv:2403.14639 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: On Defining Smart Cities using Transformer Neural NetworksComments: 16 pages, 2 fuguresJournal-ref: International Journal of Computer and Technology Vol 24 (2024) ISSN: 2277-3061Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Cities worldwide are rapidly adopting smart technologies, transforming urban life. Despite this trend, a universally accepted definition of 'smart city' remains elusive. Past efforts to define it have not yielded a consensus, as evidenced by the numerous definitions in use. In this paper, we endeavored to create a new 'compromise' definition that should resonate with most experts previously involved in defining this concept and aimed to validate one of the existing definitions. We reviewed 60 definitions of smart cities from industry, academia, and various relevant organizations, employing transformer architecture-based generative AI and semantic text analysis to reach this compromise. We proposed a semantic similarity measure as an evaluation technique, which could generally be used to compare different smart city definitions, assessing their uniqueness or resemblance. Our methodology employed generative AI to analyze various existing definitions of smart cities, generating a list of potential new composite definitions. Each of these new definitions was then tested against the pre-existing individual definitions we have gathered, using cosine similarity as our metric. This process identified smart city definitions with the highest average cosine similarity, semantically positioning them as the closest on average to all the 60 individual definitions selected.
- [1960] arXiv:2403.14641 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairnessDavid Fernández Llorca , Ronan Hamon , Henrik Junklewitz , Kathrin Grosse , Lars Kunze , Patrick Seiniger , Robert Swaim , Nick Reed , Alexandre Alahi , Emilia Gómez , Ignacio Sánchez , Akos KristonComments: 44 pages, 8 figures, submitted to a peer-review journalSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: This study explores the complexities of integrating Artificial Intelligence (AI) into Autonomous Vehicles (AVs), examining the challenges introduced by AI components and the impact on testing procedures, focusing on some of the essential requirements for trustworthy AI. Topics addressed include the role of AI at various operational layers of AVs, the implications of the EU's AI Act on AVs, and the need for new testing methodologies for Advanced Driver Assistance Systems (ADAS) and Automated Driving Systems (ADS). The study also provides a detailed analysis on the importance of cybersecurity audits, the need for explainability in AI decision-making processes and protocols for assessing the robustness and ethical behaviour of predictive systems in AVs. The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology, highlighting the need for multidisciplinary expertise.
- [1961] arXiv:2403.14642 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Revolutionising Distance Learning: A Comparative Study of Learning Progress with AI-Driven TutoringMoritz Möller , Gargi Nirmal , Dario Fabietti , Quintus Stierstorfer , Mark Zakhvatkin , Holger Sommerfeld , Sven SchüttSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Generative AI is expected to have a vast, positive impact on education; however, at present, this potential has not yet been demonstrated at scale at university level. In this study, we present first evidence that generative AI can increase the speed of learning substantially in university students. We tested whether using the AI-powered teaching assistant Syntea affected the speed of learning of hundreds of distance learning students across more than 40 courses at the IU International University of Applied Sciences. Our analysis suggests that using Syntea reduced their study time substantially--by about 27\% on average--in the third month after the release of Syntea. Taken together, the magnitude of the effect and the scalability of the approach implicate generative AI as a key lever to significantly improve and accelerate learning by personalisation.
- [1962] arXiv:2403.14643 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Exploring ChatGPT and its Impact on SocietyComments: 13 PagesJournal-ref: AI and Ethics (2024)Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: Artificial intelligence has been around for a while, but suddenly it has received more attention than ever before. Thanks to innovations from companies like Google, Microsoft, Meta, and other major brands in technology. OpenAI, though, has triggered the button with its ground-breaking invention ChatGPT. ChatGPT is a Large Language Model (LLM) based on Transformer architecture that has the ability to generate human-like responses in a conversational context. It uses deep learning algorithms to generate natural language responses to input text. Its large number of parameters, contextual generation, and open-domain training make it a versatile and effective tool for a wide range of applications, from chatbots to customer service to language translation. It has the potential to revolutionize various industries and transform the way we interact with technology. However, the use of ChatGPT has also raised several concerns, including ethical, social, and employment challenges, which must be carefully considered to ensure the responsible use of this technology. The article provides an overview of ChatGPT, delving into its architecture and training process. It highlights the potential impacts of ChatGPT on the society. In this paper, we suggest some approaches involving technology, regulation, education, and ethics in an effort to maximize ChatGPT's benefits while minimizing its negative impacts. This study is expected to contribute to a greater understanding of ChatGPT and aid in predicting the potential changes it may bring about.
- [1963] arXiv:2403.14645 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Designing Multi-Step Action Models for Enterprise AI AdoptionComments: 8 pages, 5 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: This paper introduces the Multi-Step Action Model (MSAM), a closed-source AI model designed by Empsing to address challenges hindering AI adoption in enterprises. Through a holistic examination, this paper explores MSAM's foundational principles, design architecture, and future trajectory. It evaluates MSAM's performance via rigorous testing methodologies and envisions its potential impact on advancing AI adoption within organizations.
- [1964] arXiv:2403.14650 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Harnessing the Computing Continuum across Personalized Healthcare, Maintenance and Inspection, and Farming 4.0Fatemeh Baghdadi , Davide Cirillo , Daniele Lezzi , Francesc Lordan , Fernando Vazquez , Eugenio Lomurno , Alberto Archetti , Danilo Ardagna , Matteo MatteucciSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: The AI-SPRINT project, launched in 2021 and funded by the European Commission, focuses on the development and implementation of AI applications across the computing continuum. This continuum ensures the coherent integration of computational resources and services from centralized data centers to edge devices, facilitating efficient and adaptive computation and application delivery. AI-SPRINT has achieved significant scientific advances, including streamlined processes, improved efficiency, and the ability to operate in real time, as evidenced by three practical use cases. This paper provides an in-depth examination of these applications -- Personalized Healthcare, Maintenance and Inspection, and Farming 4.0 -- highlighting their practical implementation and the objectives achieved with the integration of AI-SPRINT technologies. We analyze how the proposed toolchain effectively addresses a range of challenges and refines processes, discussing its relevance and impact in multiple domains. After a comprehensive overview of the main AI-SPRINT tools used in these scenarios, the paper summarizes of the findings and key lessons learned.
- [1965] arXiv:2403.14652 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: MemeCraft: Contextual and Stance-Driven Multimodal Meme GenerationComments: 8 pages, 7 figures, ACM MM 2024Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Abstract: Online memes have emerged as powerful digital cultural artifacts in the age of social media, offering not only humor but also platforms for political discourse, social critique, and information dissemination. Their extensive reach and influence in shaping online communities' sentiments make them invaluable tools for campaigning and promoting ideologies. Despite the development of several meme-generation tools, there remains a gap in their systematic evaluation and their ability to effectively communicate ideologies. Addressing this, we introduce MemeCraft, an innovative meme generator that leverages large language models (LLMs) and visual language models (VLMs) to produce memes advocating specific social movements. MemeCraft presents an end-to-end pipeline, transforming user prompts into compelling multimodal memes without manual intervention. Conscious of the misuse potential in creating divisive content, an intrinsic safety mechanism is embedded to curb hateful meme production.
- [1966] arXiv:2403.14658 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Identifying Potential Inlets of Man in the Artificial Intelligence Development ProcessComments: Published in CSCW '23 Conference Proceedings. 7 pages, 1 figureSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: In this paper we hope to identify how the typical or standard artificial intelligence development process encourages or facilitates the creation of racialized technologies. We begin by understanding Sylvia Wynter's definition of the biocentric Man genre and its exclusion of Blackness from humanness. We follow this with outlining what we consider to be the typical steps for developing an AI-based technology, which we have broken down into 6 stages: identifying a problem, development process and management tool selection, dataset development and data processing, model development, deployment and risk assessment, and integration and monitoring. The goal of this paper is to better understand how Wynter's biocentric Man is being represented and reinforced by the technologies we are producing in the AI lifecycle and by the lifecycle itself; we hope to identify ways in which the distinction of Blackness from the "ideal" human leads to perpetual punishment at the hands of these technologies. By deconstructing this development process, we can potentially identify ways in which humans in general have not been prioritized and how those affects are disproportionately affecting marginalized people. We hope to offer solutions that will encourage changes in the AI development cycle.
- [1967] arXiv:2403.14659 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Social Intelligence Data Infrastructure: Structuring the Present and Navigating the FutureSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: As Natural Language Processing (NLP) systems become increasingly integrated into human social life, these technologies will need to increasingly rely on social intelligence. Although there are many valuable datasets that benchmark isolated dimensions of social intelligence, there does not yet exist any body of work to join these threads into a cohesive subfield in which researchers can quickly identify research gaps and future directions. Towards this goal, we build a Social AI Data Infrastructure, which consists of a comprehensive social AI taxonomy and a data library of 480 NLP datasets. Our infrastructure allows us to analyze existing dataset efforts, and also evaluate language models' performance in different social intelligence aspects. Our analyses demonstrate its utility in enabling a thorough understanding of current data landscape and providing a holistic perspective on potential directions for future dataset development. We show there is a need for multifaceted datasets, increased diversity in language and culture, more long-tailed social situations, and more interactive data in future social intelligence data efforts.
- [1968] arXiv:2403.14660 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Machina Economicus: A New Paradigm for Prosumers in the Energy Internet of Smart CitiesComments: 25 pages, 1 figureSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Energy Internet (EI) is emerging as new share economy platform for flexible local energy supplies in smart cities. Empowered by the Internet-of-Things (IoT) and Artificial Intelligence (AI), EI aims to unlock peer-to-peer energy trading and sharing among prosumers, who can adeptly switch roles between providers and consumers in localized energy markets with rooftop photovoltaic panels, vehicle-to-everything technologies, packetized energy management, etc. The integration of prosumers in EI, however, will encounter many challenges in modelling, analyzing, and designing an efficient, economic, and social-optimal platform for energy sharing, calling for advanced AI/IoT-based solutions to resource optimization, information exchange, and interaction protocols in the context of the share economy. In this study, we aim to introduce a recently emerged paradigm, Machina Economicus, to investigate the economic rationality in modelling, analysis, and optimization of AI/IoT-based EI prosumer behaviors. The new paradigm, built upon the theory of machine learning and mechanism design, will offer new angles to investigate the selfishness of AI through a game-theoretic perspective, revealing potential competition and collaborations resulting from the self-adaptive learning and decision-making capacity. This study will focus on how the introduction of AI will reshape prosumer behaviors on the EI, and how this paradigm will reveal new research questions and directions when AI meets the share economy. With an extensive case analysis in the literature, we will also shed light on potential solutions for advancements of AI in future smart cities.
- [1969] arXiv:2403.14662 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Case Studies of AI Policy Development in AfricaKadijatou Diallo , Jonathan Smith , Chinasa T. Okolo , Dorcas Nyamwaya , Jonas Kgomo , Richard NgamitaSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Artificial Intelligence (AI) requires new ways of evaluating national technology use and strategy for African nations. We conduct a survey of existing 'readiness' assessments both for general digital adoption and for AI policy in particular. We conclude that existing global readiness assessments do not fully capture African states' progress in AI readiness and lay the groundwork for how assessments can be better used for the African context. We consider the extent to which these indicators map to the African context and what these indicators miss in capturing African states' on-the-ground work in meeting AI capability. Through case studies of four African nations of diverse geographic and economic dimensions, we identify nuances missed by global assessments and offer high-level policy considerations for how states can best improve their AI readiness standards and prepare their societies to capture the benefits of AI.
- [1970] arXiv:2403.14668 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Predicting Learning Performance with Large Language Models: A Study in Adult LiteracyLiang Zhang , Jionghao Lin , Conrad Borchers , John Sabatini , John Hollander , Meng Cao , Xiangen HuComments: 26TH International Conference on Human-Computer InteractionSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: Intelligent Tutoring Systems (ITSs) have significantly enhanced adult literacy training, a key factor for societal participation, employment opportunities, and lifelong learning. Our study investigates the application of advanced AI models, including Large Language Models (LLMs) like GPT-4, for predicting learning performance in adult literacy programs in ITSs. This research is motivated by the potential of LLMs to predict learning performance based on its inherent reasoning and computational capabilities. By using reading comprehension datasets from the ITS, AutoTutor, we evaluate the predictive capabilities of GPT-4 versus traditional machine learning methods in predicting learning performance through five-fold cross-validation techniques. Our findings show that the GPT-4 presents the competitive predictive abilities with traditional machine learning methods such as Bayesian Knowledge Tracing, Performance Factor Analysis, Sparse Factor Analysis Lite (SPARFA-Lite), tensor factorization and eXtreme Gradient Boosting (XGBoost). While XGBoost (trained on local machine) outperforms GPT-4 in predictive accuracy, GPT-4-selected XGBoost and its subsequent tuning on the GPT-4 platform demonstrates superior performance compared to local machine execution. Moreover, our investigation into hyper-parameter tuning by GPT-4 versus grid-search suggests comparable performance, albeit with less stability in the automated approach, using XGBoost as the case study. Our study contributes to the field by highlighting the potential of integrating LLMs with traditional machine learning models to enhance predictive accuracy and personalize adult literacy education, setting a foundation for future research in applying LLMs within ITSs.
- [1971] arXiv:2403.14676 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Unified Uncertainty Estimation for Cognitive Diagnosis ModelsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Cognitive diagnosis models have been widely used in different areas, especially intelligent education, to measure users' proficiency levels on knowledge concepts, based on which users can get personalized instructions. As the measurement is not always reliable due to the weak links of the models and data, the uncertainty of measurement also offers important information for decisions. However, the research on the uncertainty estimation lags behind that on advanced model structures for cognitive diagnosis. Existing approaches have limited efficiency and leave an academic blank for sophisticated models which have interaction function parameters (e.g., deep learning-based models). To address these problems, we propose a unified uncertainty estimation approach for a wide range of cognitive diagnosis models. Specifically, based on the idea of estimating the posterior distributions of cognitive diagnosis model parameters, we first provide a unified objective function for mini-batch based optimization that can be more efficiently applied to a wide range of models and large datasets. Then, we modify the reparameterization approach in order to adapt to parameters defined on different domains. Furthermore, we decompose the uncertainty of diagnostic parameters into data aspect and model aspect, which better explains the source of uncertainty. Extensive experiments demonstrate that our method is effective and can provide useful insights into the uncertainty of cognitive diagnosis.
- [1972] arXiv:2403.14680 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Trust in AI: Progress, Challenges, and Future DirectionsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: The increasing use of artificial intelligence (AI) systems in our daily life through various applications, services, and products explains the significance of trust/distrust in AI from a user perspective. AI-driven systems (as opposed to other technologies) have ubiquitously diffused in our life not only as some beneficial tools to be used by human agents but also are going to be substitutive agents on our behalf, or manipulative minds that would influence human thought, decision, and agency. Trust/distrust in AI plays the role of a regulator and could significantly control the level of this diffusion, as trust can increase, and distrust may reduce the rate of adoption of AI. Recently, varieties of studies have paid attention to the variant dimension of trust/distrust in AI, and its relevant considerations. In this systematic literature review, after conceptualization of trust in the current AI literature review, we will investigate trust in different types of human-Machine interaction, and its impact on technology acceptance in different domains. In addition to that, we propose a taxonomy of technical (i.e., safety, accuracy, robustness) and non-technical axiological (i.e., ethical, legal, and mixed) trustworthiness metrics, and some trustworthy measurements. Moreover, we examine some major trust-breakers in AI (e.g., autonomy and dignity threat), and trust makers; and propose some future directions and probable solutions for the transition to a trustworthy AI.
- [1973] arXiv:2403.14681 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: AI Ethics: A Bibliometric Analysis, Critical Issues, and Key GapsDi Kevin Gao (1,2), Andrew Haverly (1), Sudip Mittal (1), Jiming Wu (2), Jingdao Chen (1) ((1) Mississippi State University, (2) California State University - East Bay)Journal-ref: International Journal of Business Analytics (IJBAN), 2024, 11(1), 1-19Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Artificial intelligence (AI) ethics has emerged as a burgeoning yet pivotal area of scholarly research. This study conducts a comprehensive bibliometric analysis of the AI ethics literature over the past two decades. The analysis reveals a discernible tripartite progression, characterized by an incubation phase, followed by a subsequent phase focused on imbuing AI with human-like attributes, culminating in a third phase emphasizing the development of human-centric AI systems. After that, they present seven key AI ethics issues, encompassing the Collingridge dilemma, the AI status debate, challenges associated with AI transparency and explainability, privacy protection complications, considerations of justice and fairness, concerns about algocracy and human enfeeblement, and the issue of superintelligence. Finally, they identify two notable research gaps in AI ethics regarding the large ethics model (LEM) and AI identification and extend an invitation for further scholarly research.
- [1974] arXiv:2403.14682 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Deep Generative Domain Adaptation with Temporal Relation Knowledge for Cross-User Activity RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: In human activity recognition (HAR), the assumption that training and testing data are independent and identically distributed (i.i.d.) often fails, particularly in cross-user scenarios where data distributions vary significantly. This discrepancy highlights the limitations of conventional domain adaptation methods in HAR, which typically overlook the inherent temporal relations in time-series data. To bridge this gap, our study introduces a Conditional Variational Autoencoder with Universal Sequence Mapping (CVAE-USM) approach, which addresses the unique challenges of time-series domain adaptation in HAR by relaxing the i.i.d. assumption and leveraging temporal relations to align data distributions effectively across different users. This method combines the strengths of Variational Autoencoder (VAE) and Universal Sequence Mapping (USM) to capture and utilize common temporal patterns between users for improved activity recognition. Our results, evaluated on two public HAR datasets (OPPT and PAMAP2), demonstrate that CVAE-USM outperforms existing state-of-the-art methods, offering a more accurate and generalizable solution for cross-user activity recognition.
- [1975] arXiv:2403.14683 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: A Moral Imperative: The Need for Continual Superalignment of Large Language ModelsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: This paper examines the challenges associated with achieving life-long superalignment in AI systems, particularly large language models (LLMs). Superalignment is a theoretical framework that aspires to ensure that superintelligent AI systems act in accordance with human values and goals. Despite its promising vision, we argue that achieving superalignment requires substantial changes in the current LLM architectures due to their inherent limitations in comprehending and adapting to the dynamic nature of these human ethics and evolving global scenarios. We dissect the challenges of encoding an ever-changing spectrum of human values into LLMs, highlighting the discrepancies between static AI models and the dynamic nature of human societies. To illustrate these challenges, we analyze two distinct examples: one demonstrates a qualitative shift in human values, while the other presents a quantifiable change. Through these examples, we illustrate how LLMs, constrained by their training data, fail to align with contemporary human values and scenarios. The paper concludes by exploring potential strategies to address and possibly mitigate these alignment discrepancies, suggesting a path forward in the pursuit of more adaptable and responsive AI systems.
- [1976] arXiv:2403.14685 (cross-list from cs.LG) [ pdf , ps , other ]
-
Title: Cyclical Log Annealing as a Learning Rate SchedulerSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: A learning rate scheduler is a predefined set of instructions for varying search stepsizes during model training processes. This paper introduces a new logarithmic method using harsh restarting of step sizes through stochastic gradient descent. Cyclical log annealing implements the restart pattern more aggressively to maybe allow the usage of more greedy algorithms on the online convex optimization framework. The algorithm was tested on the CIFAR-10 image datasets, and seemed to perform analogously with cosine annealing on large transformer-enhanced residual neural networks. Future experiments would involve testing the scheduler in generative adversarial networks and finding the best parameters for the scheduler with more experiments.
- [1977] arXiv:2403.14687 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: On the Performance of Imputation Techniques for Missing Values on Healthcare DatasetsSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI)
Abstract: Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This could be frustrating when using machine learning algorithms on such datasets, simply because most machine learning models perform poorly in the presence of missing values. The aim of this study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE), on three healthcare datasets. Some percentage of missing values - 10\%, 15\%, 20\% and 25\% - were introduced into the dataset, and the imputation techniques were employed to impute these missing values. The comparison of their performance was evaluated by using root mean squared error (RMSE) and mean absolute error (MAE). The results show that Missforest imputation performs the best followed by MICE imputation. Additionally, we try to determine whether it is better to perform feature selection before imputation or vice versa by using the following metrics - the recall, precision, f1-score and accuracy. Due to the fact that there are few literature on this and some debate on the subject among researchers, we hope that the results from this experiment will encourage data scientists and researchers to perform imputation first before feature selection when dealing with data containing missing values.
- [1978] arXiv:2403.14689 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Developing and Deploying Industry Standards for Artificial Intelligence in Education (AIED): Challenges, Strategies, and Future DirectionsComments: 12 pagesSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: The adoption of Artificial Intelligence in Education (AIED) holds the promise of revolutionizing educational practices by offering personalized learning experiences, automating administrative and pedagogical tasks, and reducing the cost of content creation. However, the lack of standardized practices in the development and deployment of AIED solutions has led to fragmented ecosystems, which presents challenges in interoperability, scalability, and ethical governance. This article aims to address the critical need to develop and implement industry standards in AIED, offering a comprehensive analysis of the current landscape, challenges, and strategic approaches to overcome these obstacles. We begin by examining the various applications of AIED in various educational settings and identify key areas lacking in standardization, including system interoperability, ontology mapping, data integration, evaluation, and ethical governance. Then, we propose a multi-tiered framework for establishing robust industry standards for AIED. In addition, we discuss methodologies for the iterative development and deployment of standards, incorporating feedback loops from real-world applications to refine and adapt standards over time. The paper also highlights the role of emerging technologies and pedagogical theories in shaping future standards for AIED. Finally, we outline a strategic roadmap for stakeholders to implement these standards, fostering a cohesive and ethical AIED ecosystem. By establishing comprehensive industry standards, such as those by IEEE Artificial Intelligence Standards Committee (AISC) and International Organization for Standardization (ISO), we can accelerate and scale AIED solutions to improve educational outcomes, ensuring that technological advances align with the principles of inclusivity, fairness, and educational excellence.
- [1979] arXiv:2403.14690 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Incorporating Graph Attention Mechanism into Geometric Problem Solving Based on Deep Reinforcement LearningXiuqin Zhong , Shengyuan Yan , Gongqi Lin , Hongguang Fu , Liang Xu , Siwen Jiang , Lei Huang , Wei FangSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract: In the context of online education, designing an automatic solver for geometric problems has been considered a crucial step towards general math Artificial Intelligence (AI), empowered by natural language understanding and traditional logical inference. In most instances, problems are addressed by adding auxiliary components such as lines or points. However, adding auxiliary components automatically is challenging due to the complexity in selecting suitable auxiliary components especially when pivotal decisions have to be made. The state-of-the-art performance has been achieved by exhausting all possible strategies from the category library to identify the one with the maximum likelihood. However, an extensive strategy search have to be applied to trade accuracy for ef-ficiency. To add auxiliary components automatically and efficiently, we present deep reinforcement learning framework based on the language model, such as BERT. We firstly apply the graph attention mechanism to reduce the strategy searching space, called AttnStrategy, which only focus on the conclusion-related components. Meanwhile, a novel algorithm, named Automatically Adding Auxiliary Components using Reinforcement Learning framework (A3C-RL), is proposed by forcing an agent to select top strategies, which incorporates the AttnStrategy and BERT as the memory components. Results from extensive experiments show that the proposed A3C-RL algorithm can substantially enhance the average precision by 32.7% compared to the traditional MCTS. In addition, the A3C-RL algorithm outperforms humans on the geometric questions from the annual University Entrance Mathematical Examination of China.
- [1980] arXiv:2403.14691 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Healthcare ProfessionalsComments: 1 figureSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: This paper explores the evolving relationship between clinician trust in LLMs, the transformation of data sources from predominantly human-generated to AI-generated content, and the subsequent impact on the precision of LLMs and clinician competence. One of the primary concerns identified is the potential feedback loop that arises as LLMs become more reliant on their outputs for learning, which may lead to a degradation in output quality and a reduction in clinician skills due to decreased engagement with fundamental diagnostic processes. While theoretical at this stage, this feedback loop poses a significant challenge as the integration of LLMs in healthcare deepens, emphasizing the need for proactive dialogue and strategic measures to ensure the safe and effective use of LLM technology. A key takeaway from our investigation is the critical role of user expertise and the necessity for a discerning approach to trusting and validating LLM outputs. The paper highlights how expert users, particularly clinicians, can leverage LLMs to enhance productivity by offloading routine tasks while maintaining a critical oversight to identify and correct potential inaccuracies in AI-generated content. This balance of trust and skepticism is vital for ensuring that LLMs augment rather than undermine the quality of patient care. Moreover, we delve into the potential risks associated with LLMs' self-referential learning loops and the deskilling of healthcare professionals. The risk of LLMs operating within an echo chamber, where AI-generated content feeds into the learning algorithms, threatens the diversity and quality of the data pool, potentially entrenching biases and reducing the efficacy of LLMs.
- [1981] arXiv:2403.14692 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: The AI Assessment Scale (AIAS) in action: A pilot implementation of GenAI supported assessmentSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: The rapid adoption of Generative Artificial Intelligence (GenAI) technologies in higher education has raised concerns about academic integrity, assessment practices, and student learning. Banning or blocking GenAI tools has proven ineffective, and punitive approaches ignore the potential benefits of these technologies. This paper presents the findings of a pilot study conducted at British University Vietnam (BUV) exploring the implementation of the Artificial Intelligence Assessment Scale (AIAS), a flexible framework for incorporating GenAI into educational assessments. The AIAS consists of five levels, ranging from 'No AI' to 'Full AI', enabling educators to design assessments that focus on areas requiring human input and critical thinking.
Following the implementation of the AIAS, the pilot study results indicate a significant reduction in academic misconduct cases related to GenAI, a 5.9% increase in student attainment across the university, and a 33.3% increase in module passing rates. The AIAS facilitated a shift in pedagogical practices, with faculty members incorporating GenAI tools into their modules and students producing innovative multimodal submissions. The findings suggest that the AIAS can support the effective integration of GenAI in HE, promoting academic integrity while leveraging the technology's potential to enhance learning experiences. - [1982] arXiv:2403.14693 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: A2CI: A Cloud-based, Service-oriented Geospatial Cyberinfrastructure to Support Atmospheric ResearchSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
Abstract: Big earth science data offers the scientific community great opportunities. Many more studies at large-scales, over long-terms and at high resolution can now be conducted using the rich information collected by remote sensing satellites, ground-based sensor networks, and even social media input. However, the hundreds of terabytes of information collected and compiled on an hourly basis by NASA and other government agencies present a significant challenge for atmospheric scientists seeking to improve the understanding of the Earth atmospheric system. These challenges include effective discovery, organization, analysis and visualization of large amounts of data. This paper reports the outcomes of an NSF-funded project that developed a geospatial cyberinfrastructure -- the A2CI (Atmospheric Analysis Cyberinfrastructure) -- to support atmospheric research. We first introduce the service-oriented system framework then describe in detail the implementation of the data discovery module, data management module, data integration module, data analysis and visualization modules following the cloud computing principles-Data-as-a-Service, Software-as-a-Service, Platform-as-a-Service and Infrastructure-as-a-Service. We demonstrate the graphic user interface by performing an analysis between Sea Surface Temperature and the intensity of tropical storms in the North Atlantic and Pacific oceans. We expect this work to contribute to the technical advancement of cyberinfrastructure research as well as to the development of an online, collaborative scientific analysis system for atmospheric science.
- [1983] arXiv:2403.14694 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Application of GPT Language Models for Innovation in Activities in University TeachingComments: 15 pages, in spanish language, 4 tables, 5 figuresSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abstract: The GPT (Generative Pre-trained Transformer) language models are an artificial intelligence and natural language processing technology that enables automatic text generation. There is a growing interest in applying GPT language models to university teaching in various dimensions. From the perspective of innovation in student and teacher activities, they can provide support in understanding and generating content, problem-solving, as well as personalization and test correction, among others. From the dimension of internationalization, the misuse of these models represents a global problem that requires taking a series of common measures in universities from different geographical areas. In several countries, there has been a review of assessment tools to ensure that work is done by students and not by AI. To this end, we have conducted a detailed experiment in a representative subject of Computer Science such as Software Engineering, which has focused on evaluating the use of ChatGPT as an assistant in theory activities, exercises, and laboratory practices, assessing its potential use as a support tool for both students and teachers.
- [1984] arXiv:2403.14697 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: An AIC-based approach for articulating unpredictable problems in open complex environmentsComments: S. Bernardi, T. Zoppi (Editors), "Fast Abstracts and Student Forum Proceedings - EDCC 2024 - 19th European Dependable Computing Conference, Leuven, Belgium, 8-11 April 2024"Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Abstract: This research paper presents an approach to enhancing the predictive capability of architects in the design and assurance of systems, focusing on systems operating in dynamic and unpredictable environments. By adopting a systems approach, we aim to improve architects' predictive capabilities in designing dependable systems (for example, ML-based systems). An aerospace case study is used to illustrate the approach. Multiple factors (challenges) influencing aircraft detection are identified, demonstrating the effectiveness of our approach in a complex operational setting. Our approach primarily aimed to enhance the architect's predictive capability.
- [1985] arXiv:2403.14704 (cross-list from cs.LO) [ pdf , ps , html , other ]
-
Title: A minimal coalition logicSubjects: Logic in Computer Science (cs.LO) ; Artificial Intelligence (cs.AI)
Abstract: Coalition logic is a central logic in strategic reasoning studies. In this paper, we first argue that Coalition Logic models, concurrent game models, have three too-strong assumptions. The first one is the independence of agents; that is, the merge of two available joint actions of two disjoint coalitions is always available for the union of the two coalitions. The second one is seriality; that is, coalitions always have available joint actions. The third one is determinism, that is, the grand coalition's joint actions always have a unique outcome. Second, we present a coalition logic based on general concurrent game models, which do not have the three assumptions. We show the completeness of this logic and compare it with Coalition Logic in detail. This logic seems minimal in the context of strategic reasoning.
- [1986] arXiv:2403.14706 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Safeguarding Marketing Research: The Generation, Identification, and Mitigation of AI-Fabricated DisinformationSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Generative AI has ushered in the ability to generate content that closely mimics human contributions, introducing an unprecedented threat: Deployed en masse, these models can be used to manipulate public opinion and distort perceptions, resulting in a decline in trust towards digital platforms. This study contributes to marketing literature and practice in three ways. First, it demonstrates the proficiency of AI in fabricating disinformative user-generated content (UGC) that mimics the form of authentic content. Second, it quantifies the disruptive impact of such UGC on marketing research, highlighting the susceptibility of analytics frameworks to even minimal levels of disinformation. Third, it proposes and evaluates advanced detection frameworks, revealing that standard techniques are insufficient for filtering out AI-generated disinformation. We advocate for a comprehensive approach to safeguarding marketing research that integrates advanced algorithmic solutions, enhanced human oversight, and a reevaluation of regulatory and ethical frameworks. Our study seeks to serve as a catalyst, providing a foundation for future research and policy-making aimed at navigating the intricate challenges at the nexus of technology, ethics, and marketing.
- [1987] arXiv:2403.14710 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Use of recommendation models to provide support to dyslexic studentsGianluca Morciano , José Manuel Alcalde-Llergo , Andrea Zingoni , Enrique Yeguas-Bolivar , Juri Taborri , Giuseppe CalabròComments: 36 pages, 4 figures and 6 tables. Preprint submitted to Expert Systems with ApplicationsJournal-ref: Expert Systems with Applications Volume 249, Part C, 1 September 2024, 123738Subjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: Dyslexia is the most widespread specific learning disorder and significantly impair different cognitive domains. This, in turn, negatively affects dyslexic students during their learning path. Therefore, specific support must be given to these students. In addition, such a support must be highly personalized, since the problems generated by the disorder can be very different from one to another. In this work, we explored the possibility of using AI to suggest the most suitable supporting tools for dyslexic students, so as to provide a targeted help that can be of real utility. To do this, we relied on recommendation algorithms, which are a branch of machine learning, that aim to detect personal preferences and provide the most suitable suggestions. We hence implemented and trained three collaborative-filtering recommendation models, namely an item-based, a user-based and a weighted-hybrid model, and studied their performance on a large database of 1237 students' information, collected with a self-evaluating questionnaire regarding all the most used supporting strategies and digital tools. Each recommendation model was tested with three different similarity metrics, namely Pearson correlation, Euclidean distance and Cosine similarity. The obtained results showed that a recommendation system is highly effective in suggesting the optimal help tools/strategies for everyone. This demonstrates that the proposed approach is successful and can be used as a new and effective methodology to support students with dyslexia.
- [1988] arXiv:2403.14711 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: Human-in-the-Loop AI for Cheating Ring DetectionComments: Accepted to the AI4Ed Workshop at AAAI 2024 as a short paperSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Online exams have become popular in recent years due to their accessibility. However, some concerns have been raised about the security of the online exams, particularly in the context of professional cheating services aiding malicious test takers in passing exams, forming so-called "cheating rings". In this paper, we introduce a human-in-the-loop AI cheating ring detection system designed to detect and deter these cheating rings. We outline the underlying logic of this human-in-the-loop AI system, exploring its design principles tailored to achieve its objectives of detecting cheaters. Moreover, we illustrate the methodologies used to evaluate its performance and fairness, aiming to mitigate the unintended risks associated with the AI system. The design and development of the system adhere to Responsible AI (RAI) standards, ensuring that ethical considerations are integrated throughout the entire development process.
- [1989] arXiv:2403.14712 (cross-list from cs.CY) [ pdf , ps , html , other ]
-
Title: AI for bureaucratic productivity: Measuring the potential of AI to help automate 143 million UK government transactionsVincent J. Straub , Youmna Hashem , Jonathan Bright , Satyam Bhagwanani , Deborah Morgan , John Francis , Saba Esnaashari , Helen MargettsSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: There is currently considerable excitement within government about the potential of artificial intelligence to improve public service productivity through the automation of complex but repetitive bureaucratic tasks, freeing up the time of skilled staff. Here, we explore the size of this opportunity, by mapping out the scale of citizen-facing bureaucratic decision-making procedures within UK central government, and measuring their potential for AI-driven automation. We estimate that UK central government conducts approximately one billion citizen-facing transactions per year in the provision of around 400 services, of which approximately 143 million are complex repetitive transactions. We estimate that 84% of these complex transactions are highly automatable, representing a huge potential opportunity: saving even an average of just one minute per complex transaction would save the equivalent of approximately 1,200 person-years of work every year. We also develop a model to estimate the volume of transactions a government service undertakes, providing a way for government to avoid conducting time consuming transaction volume measurements. Finally, we find that there is high turnover in the types of services government provide, meaning that automation efforts should focus on general procedures rather than services themselves which are likely to evolve over time. Overall, our work presents a novel perspective on the structure and functioning of modern government, and how it might evolve in the age of artificial intelligence.
- [1990] arXiv:2403.14715 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Understanding Why Label Smoothing Degrades Selective Classification and How to Fix ItSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: Label smoothing (LS) is a popular regularisation method for training deep neural network classifiers due to its effectiveness in improving test accuracy and its simplicity in implementation. "Hard" one-hot labels are "smoothed" by uniformly distributing probability mass to other classes, reducing overfitting. In this work, we reveal that LS negatively affects selective classification (SC) - where the aim is to reject misclassifications using a model's predictive uncertainty. We first demonstrate empirically across a range of tasks and architectures that LS leads to a consistent degradation in SC. We then explain this by analysing logit-level gradients, showing that LS exacerbates overconfidence and underconfidence by regularising the max logit more when the probability of error is low, and less when the probability of error is high. This elucidates previously reported experimental results where strong classifiers underperform in SC. We then demonstrate the empirical effectiveness of logit normalisation for recovering lost SC performance caused by LS. Furthermore, based on our gradient analysis, we explain why such normalisation is effective. We will release our code shortly.
- [1991] arXiv:2403.14734 (cross-list from cs.SE) [ pdf , ps , html , other ]
-
Title: A Survey of Neural Code Intelligence: Paradigms, Advances and BeyondQiushi Sun , Zhirui Chen , Fangzhi Xu , Kanzhi Cheng , Chang Ma , Zhangyue Yin , Jianing Wang , Chengcheng Han , Renyu Zhu , Shuai Yuan , Qipeng Guo , Xipeng Qiu , Pengcheng Yin , Xiaoli Li , Fei Yuan , Lingpeng Kong , Xiang Li , Zhiyong WuComments: 64 pages, 6 figures, 10 tables, 688 referencesSubjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
Abstract: Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions. An ongoing, dynamically updated project and resources associated with this survey have been released at this https URL .
- [1992] arXiv:2403.14736 (cross-list from q-bio.QM) [ pdf , ps , html , other ]
-
Title: NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural NetworksSubjects: Quantitative Methods (q-bio.QM) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract: Protein classification tasks are essential in drug discovery. Real-world protein structures are dynamic, which will determine the properties of proteins. However, the existing machine learning methods, like ProNet (Wang et al., 2022a), only access limited conformational characteristics and protein side-chain features, leading to impractical protein structure and inaccuracy of protein classes in their predictions. In this paper, we propose novel semantic data augmentation methods, Novel Augmentation of New Node Attributes (NaNa), and Molecular Interactions and Geometric Upgrading (MiGu) to incorporate backbone chemical and side-chain biophysical information into protein classification tasks and a co-embedding residual learning framework. Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, and ionic features of proteins to facilitate protein classification tasks. Furthermore, our semantic augmentation methods and the co-embedding residual learning framework can improve the performance of GIN (Xu et al., 2019) on EC and Fold datasets (Bairoch, 2000; Andreeva et al., 2007) by 16.41% and 11.33% respectively. Our code is available at this https URL .
- [1993] arXiv:2403.14763 (cross-list from hep-th) [ pdf , ps , html , other ]
-
Title: Gravitational Duals from Equations of StateYago Bea , Raul Jimenez , David Mateos , Shuheng Liu , Pavlos Protopapas , Pedro Tarancón-Álvarez , Pablo Tejerina-PérezSubjects: High Energy Physics - Theory (hep-th) ; Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
Abstract: Holography relates gravitational theories in five dimensions to four-dimensional quantum field theories in flat space. Under this map, the equation of state of the field theory is encoded in the black hole solutions of the gravitational theory. Solving the five-dimensional Einstein's equations to determine the equation of state is an algorithmic, direct problem. Determining the gravitational theory that gives rise to a prescribed equation of state is a much more challenging, inverse problem. We present a novel approach to solve this problem based on physics-informed neural networks. The resulting algorithm is not only data-driven but also informed by the physics of the Einstein's equations. We successfully apply it to theories with crossovers, first- and second-order phase transitions.
- [1994] arXiv:2403.14772 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Improving Robustness to Model Inversion Attacks via Sparse Coding ArchitecturesComments: 32 pages, 15 Tables, and 9 FiguresSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract: Recent model inversion attack algorithms permit adversaries to reconstruct a neural network's private training data just by repeatedly querying the network and inspecting its outputs. In this work, we develop a novel network architecture that leverages sparse-coding layers to obtain superior robustness to this class of attacks. Three decades of computer science research has studied sparse coding in the context of image denoising, object recognition, and adversarial misclassification settings, but to the best of our knowledge, its connection to state-of-the-art privacy vulnerabilities remains unstudied. However, sparse coding architectures suggest an advantageous means to defend against model inversion attacks because they allow us to control the amount of irrelevant private information encoded in a network's intermediate representations in a manner that can be computed efficiently during training and that is known to have little effect on classification accuracy. Specifically, compared to networks trained with a variety of state-of-the-art defenses, our sparse-coding architectures maintain comparable or higher classification accuracy while degrading state-of-the-art training data reconstructions by factors of 1.1 to 18.3 across a variety of reconstruction quality metrics (PSNR, SSIM, FID). This performance advantage holds across 5 datasets ranging from CelebA faces to medical images and CIFAR-10, and across various state-of-the-art SGD-based and GAN-based inversion attacks, including Plug-&-Play attacks. We provide a cluster-ready PyTorch codebase to promote research and standardize defense evaluations.
- [1995] arXiv:2403.14773 (cross-list from cs.CV) [ pdf , ps , other ]
-
Title: StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from TextRoberto Henschel , Levon Khachatryan , Daniil Hayrapetyan , Hayk Poghosyan , Vahram Tadevosyan , Zhangyang Wang , Shant Navasardyan , Humphrey ShiComments: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Abstract: Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video generation (typically 16 or 24 frames), ending up with hard-cuts when naively extended to the case of long video synthesis. To overcome these limitations, we introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions. The key components are:(i) a short-term memory block called conditional attention module (CAM), which conditions the current generation on the features extracted from the previous chunk via an attentional mechanism, leading to consistent chunk transitions, (ii) a long-term memory block called appearance preservation module, which extracts high-level scene and object features from the first video chunk to prevent the model from forgetting the initial scene, and (iii) a randomized blending approach that enables to apply a video enhancer autoregressively for infinitely long videos without inconsistencies between chunks. Experiments show that StreamingT2V generates high motion amount. In contrast, all competing image-to-video methods are prone to video stagnation when applied naively in an autoregressive manner. Thus, we propose with StreamingT2V a high-quality seamless text-to-long video generator that outperforms competitors with consistency and motion. Our code will be available at: this https URL
- [1996] arXiv:2403.14783 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question AnsweringComments: A full version of the paper will be released soon. The codes are available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Abstract: This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.
- [1997] arXiv:2403.14790 (cross-list from cs.CV) [ pdf , ps , html , other ]
-
Title: Latent Diffusion Models for Attribute-Preserving Image AnonymizationSubjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Abstract: Generative techniques for image anonymization have great potential to generate datasets that protect the privacy of those depicted in the images, while achieving high data fidelity and utility. Existing methods have focused extensively on preserving facial attributes, but failed to embrace a more comprehensive perspective that considers the scene and background into the anonymization process. This paper presents, to the best of our knowledge, the first approach to image anonymization based on Latent Diffusion Models (LDMs). Every element of a scene is maintained to convey the same meaning, yet manipulated in a way that makes re-identification difficult. We propose two LDMs for this purpose: CAMOUFLaGE-Base exploits a combination of pre-trained ControlNets, and a new controlling mechanism designed to increase the distance between the real and anonymized images. CAMOFULaGE-Light is based on the Adapter technique, coupled with an encoding designed to efficiently represent the attributes of different persons in a scene. The former solution achieves superior performance on most metrics and benchmarks, while the latter cuts the inference time in half at the cost of fine-tuning a lightweight module. We show through extensive experimental comparison that the proposed method is competitive with the state-of-the-art concerning identity obfuscation whilst better preserving the original content of the image and tackling unresolved challenges that current solutions fail to address.
- [1998] arXiv:2403.14791 (cross-list from cs.CY) [ pdf , ps , other ]
-
Title: Particip-AI: A Democratic Surveying Framework for Anticipating Future AI Use Cases, Harms and BenefitsJimin Mun , Liwei Jiang , Jenny Liang , Inyoung Cheong , Nicole DeCario , Yejin Choi , Tadayoshi Kohno , Maarten SapComments: 35 pages, 4 figures, 23 tablesSubjects: Computers and Society (cs.CY) ; Artificial Intelligence (cs.AI)
Abstract: General purpose AI, such as ChatGPT, seems to have lowered the barriers for the public to use AI and harness its power. However, the governance and development of AI still remain in the hands of a few, and the pace of development is accelerating without proper assessment of risks. As a first step towards democratic governance and risk assessment of AI, we introduce Particip-AI, a framework to gather current and future AI use cases and their harms and benefits from non-expert public. Our framework allows us to study more nuanced and detailed public opinions on AI through collecting use cases, surfacing diverse harms through risk assessment under alternate scenarios (i.e., developing and not developing a use case), and illuminating tensions over AI development through making a concluding choice on its development. To showcase the promise of our framework towards guiding democratic AI, we gather responses from 295 demographically diverse participants. We find that participants' responses emphasize applications for personal life and society, contrasting with most current AI development's business focus. This shows the value of surfacing diverse harms that are complementary to expert assessments. Furthermore, we found that perceived impact of not developing use cases predicted participants' judgements of whether AI use cases should be developed, and highlighted lay users' concerns of techno-solutionism. We conclude with a discussion on how frameworks like Particip-AI can further guide democratic AI governance and regulation.
- [1999] arXiv:2403.14800 (cross-list from cs.LG) [ pdf , ps , html , other ]
-
Title: Deep Active Learning: A Reality CheckSubjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Abstract: We conduct a comprehensive evaluation of state-of-the-art deep active learning methods. Surprisingly, under general settings, no single-model method decisively outperforms entropy-based active learning, and some even fall short of random sampling. We delve into overlooked aspects like starting budget, budget step, and pretraining's impact, revealing their significance in achieving superior results. Additionally, we extend our evaluation to other tasks, exploring the active learning effectiveness in combination with semi-supervised learning, and object detection. Our experiments provide valuable insights and concrete recommendations for future active learning studies. By uncovering the limitations of current methods and understanding the impact of different experimental settings, we aim to inspire more efficient training of deep learning models in real-world scenarios with limited annotation budgets. This work contributes to advancing active learning's efficacy in deep learning and empowers researchers to make informed decisions when applying active learning to their tasks.
- [2000] arXiv:2403.14814 (cross-list from cs.CL) [ pdf , ps , other ]
-
Title: The opportunities and risks of large language models in mental healthHannah R. Lawrence , Renee A. Schneider , Susan B. Rubin , Maja J. Mataric , Daniel J. McDuff , Megan Jones BellComments: 12 pages, 2 tables, 4 figuresSubjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Abstract: Global rates of mental health concerns are rising and there is increasing realization that existing models of mental healthcare will not adequately expand to meet the demand. With the emergence of large language models (LLMs) has come great optimism regarding their promise to create novel, large-scale solutions to support mental health. Despite their nascence, LLMs have already been applied to mental health-related tasks. In this review, we summarize the extant literature on efforts to use LLMs to provide mental health education, assessment, and intervention and highlight key opportunities for positive impact in each area. We then highlight risks associated with LLMs application to mental health and encourage adoption of strategies to mitigate these risks. The urgent need for mental health support must be balanced with responsible development, testing, and deployment of mental health LLMs. Especially critical is ensuring that mental health LLMs are fine-tuned for mental health, enhance mental health equity, adhere to ethical standards, and that people, including those with lived experience with mental health concerns, are involved in all stages from development through deployment. Prioritizing these efforts will minimize potential harms to mental health and maximize the likelihood that LLMs will positively impact mental health globally.
- [2001] arXiv:2403.14817 (cross-list from eess.AS) [ pdf , ps , html , other ]
-
Title: Crowdsourced Multilingual Speech Intelligibility TestingSubjects: Audio and Speech Processing (eess.AS) ; Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Abstract: With the advent of generative audio features, there is an increasing need for rapid evaluation of their impact on speech intelligibility. Beyond the existing laboratory measures, which are expensive and do not scale well, there has been comparatively little work on crowdsourced assessment of intelligibility. Standards and recommendations are yet to be defined, and publicly available multilingual test materials are lacking. In response to this challenge, we propose an approach for a crowdsourced intelligibility assessment. We detail the test design, the collection and public release of the multilingual speech data, and the results of our early experiments.
Total of 2552 entries :
2-2001
2001-2552